Prompt Injection, Abuse Prevention, And Policy Enforcement

What's being tested

Interviewers are probing your ability to design reliable, performant engineering controls that prevent and mitigate prompt injection attacks while enforcing organizational policy at scale. Expect to show system-design tradeoffs (latency, throughput, fault tolerance), concrete enforcement patterns (sanitization, isolation, PDP/PEP), and pragmatic observability + recovery strategies that a backend/infra Software Engineer would own.

Core knowledge

Prompt injection: attacker-supplied input that attempts to override system prompts or instructions; treat user text as untrusted input and design defenses similar to SQL/command injection protections.
Policy enforcement architecture: separate Policy Decision Point (PDP) and Policy Enforcement Point (PEP); PDP evaluates rules, PEP applies allow/deny and transforms; implement PDP as a fast, horizontally scalable service like Open Policy Agent (OPA).
Input canonicalization & sanitization: normalize encodings, remove control sequences, canonicalize whitespace and Unicode, and strip prompt-like tokens before handing input to the model to reduce attack surface.
Capability-based isolation: follow least-privilege for model calls and downstream tool access; represent allowed actions as capability tokens and enforce them at runtime in the service that invokes tools or external APIs.
Sandboxing model outputs: execute any model-generated actions (code, shell commands, tool calls) in an isolated runtime (container or jailed process) with resource limits and no network egress unless explicitly permitted.
Runtime defenses: combine rate limiting, circuit breakers, and quota enforcement to slow brute-force exploitation; size capacity using QPS * (avg_latency + filter_latency) to calculate needed concurrency.
Content policies & filtering: implement multi-stage filters: fast syntactic checks (regex/allowlist), then semantic checks (policy engine or ML classifier). Keep false-positive/negative tradeoff explicit — tuning required per product.
Auditing and provenance: log raw input, canonicalized input, PDP decisions, model outputs, and final actions with tamper-evident timestamps; use append-only stores (e.g., Postgres write-ahead or Kafka) for forensic analysis.
Latency tradeoffs: total request latency = filter_latency + model_latency + enforcement_latency; optimize by short-circuiting cheap failures and batching PDP calls for multiple requests when safe.
Metric design for safety: track attack-rate proxies like discarded prompts per K requests, mean-time-to-detect suspicious patterns, and rollback frequency; instrument at p99 latency, error budget, and security-related SLOs.
Testing & deployment: use fuzzing (structured input mutation) and red-team suites in CI to surface injection vectors; deploy policy changes behind feature flags and canary them with scoped cohorts.
Failure modes & recovery: design for graceful degradation—if PDP or filter is down, default to deny or degraded read-only mode; ensure observability to avoid silent bypasses.

Worked example

Design a prompt-sanitization and policy-enforcement service for an LLM inference API. Start by clarifying guarantees: acceptable extra latency budget, whether blocking or transforming inputs is allowed, and what downstream actions the model can trigger. Organize the service into three pillars: (1) a preprocessor that canonicalizes input and applies syntactic rules; (2) a PDP (OPA) that evaluates semantic policies and returns decisions; (3) a runtime enforcer that applies decisions, invokes the model, and sandboxes any action outputs. Key tradeoff: strict blocking reduces risk but increases false positives and user friction; prefer transform-or-flag patterns when product permits. Implementation details to call out: cache PDP decisions for identical normalized inputs (LRU with TTL), batch PDP evaluations to amortize cost, and instrument end-to-end traces with trace-id for linking logs. Close by proposing rollout steps: unit tests + fuzz suite, canary 1% traffic with verbose logging, then progressive ramp with dashboarded safety metrics. If more time: add a feedback loop where human review outcomes retrain or update policy rules and integrate automated escalation for high-severity hits.

A second angle

Consider a system where users can upload code snippets that the model can execute (e.g., code-assistant). The same concepts apply but constraints tighten: execution sandboxing must include CPU/memory limits, syscall filtering (seccomp), and strict network isolation. Policy decisions now include resource caps per user and enforced runtime timeouts; enforcement points must mediate both model outputs and user-submitted artifacts. Engineering focus shifts toward deterministic replayability (for debugging), artifact attestation, and provenance linking between the uploaded code, model instructions, and any side-effects produced by execution environments.

Common pitfalls

Pitfall: Relying solely on downstream ML classifiers to catch malicious prompts — these can be bypassed and introduce latency; instead combine syntactic short-circuits with semantic policy checks. Designers often over-trust classifiers; add deterministic rules and fail-closed behavior where safety matters.

Pitfall: Caching raw PDP responses without normalization — attackers can bypass caches with trivial whitespace or encoding tricks. Always cache on the canonicalized representation and include versioning keys for policy rule updates.

Pitfall: Prioritizing latency without explicit degradation paths — removing enforcement during spikes silently removes safety. Design explicit degraded modes (deny-by-default or read-only) and make them visible with metrics and alerts so outage isn’t a silent failure.

Connections

Interviewers may pivot to access-control system design (RBAC/ABAC), secure multi-tenant architectures, or CI/CD safety pipelines (policy-as-code rollouts). Be prepared to discuss how enforcement scales across services, how to version policies, and how to safely iterate on rules in production.