Production System Design Tradeoffs

What's being tested

These prompts probe an ML Engineer’s ability to design production ML systems that trade off latency, cost, and quality while maintaining safety, observability, and updateability. Interviewers expect clear framing (SLA, throughput, failure modes), a decomposition of system components (embeddings/index, reranker, generator, human-in-loop), and concrete operational strategies for monitoring, rollout, and remediation. The emphasis is on pragmatic engineering choices—what to measure, how to reduce p99 latency, when to accept approximate algorithms, and how to validate behavior in production.

Core knowledge

Retrieval-augmented generation (RAG) architecture: separate retriever (fast, high-recall) and generator (slow, high-precision). Retrieval reduces generator input length and cost; reranking trades latency for accuracy.
Embedding indexes and ANN methods: use HNSW for high-accuracy low-latency at medium scale, IVF-PQ/product quantization for memory compression at >10M vectors, OPQ for rotation gains. Cite FAISS or Milvus as implementations.
Vector storage sizing: 10M vectors × 1536 dims × 4 bytes ≈ 60GB raw; quantization can reduce 8–16×. Plan host/replication accordingly and account for in-memory warm caches for p99.
Latency decomposition: $L_{total}=L_{network}+L_{retrieval}+L_{rerank}+L_{generation}+L_{postprocess}$ . Optimize largest contributors first; reduce retrieval candidates k to cut reranker/generator cost.
Reranker designs: lightweight cross-encoder reranker for top-100 candidates, dense dot-product for >10k, cascade: cheap lexical filter → dense retriever → cross-encoder. Cross-encoders often increase latency but boost MRR.
Cache and TTL strategies: cache hits reduce p99; use sharded Redis or local LRU with cache-hit probability $p_{hit}$ to estimate savings: $E[L]=p_{hit}L_{cache}+(1-p_{hit})L_{miss}.$
Freshness vs. indexing cost: trade frequency of index rebuilds vs. staleness; consider incremental updates (append-only + background merge) for document churn. For high-churn sources, favor nearline embeddings and short TTL retrieval caches.
Safety and filtering: implement lightweight classifier pre-filters (binary fast model) and heavier human review for borderline cases; log all model decisions for audit and retraining. Monitor false negatives on safety-critical classes.
Evaluation metrics: use recall@k, MRR, precision@k for retrieval; response latency (p50, p95, p99) and cost-per-query for infra; user-centric metrics like task success if available. Track offline vs. online metric drift.
Online testing & rollout: prefer shadow testing and canary traffic, gradual ramp with feature-flag gating, and automatic rollback on signal degradation. For model selection, run A/B with statistical power calculations (alpha control).
Drift detection: measure embedding distribution shift (mean cosine distance), label distribution shifts, and sudden metric drops. Trigger retraining when shifts exceed threshold or downstream metrics degrade.
Auditability/versioning: record model versions, embedding schema, index build parameters, and deterministic seeding for rerankers; include request/response hashes and input document ids for later reproductions.

Worked example — Design a low-latency RAG system

First 30 seconds: clarify SLA (e.g., p95 < 300ms), throughput (qps), dataset size (10M vs 100M docs), and acceptable cost per request. State assumptions: embeddings are 1536-d, retrieval candidates k=50, generator is a 6B LLM with ~150ms token latency.

Skeleton answer pillars: (1) Retrieval stack (ANN index + sharded replica layout), (2) Reranking cascade (light lexical → dense → cross-encoder on top-k), (3) Generation and token budgeting (prompt truncation, response length caps), (4) Caching and pre-warming (query-result & partial generation caching), (5) Monitoring & rollout (p99 latency, recall@k, shadow tests). A concrete tradeoff: reduce k to 10 to cut generator invocations and reranker cost at the expense of recall; compensate with a stronger retriever or synthetic augmentation. For latency, propose moving cross-encoder to asynchronous rerank for immediate safe fallback and later refined response. Close by noting next steps: simulate QPS with realistic queries, profile per-component latencies, and if more time, add adaptive-k retrieval and cost-based routing (cold vs. warm paths).

A second angle — Design a chatbot fallback for unknown questions

Frame: define "unknown" — out-of-distribution, low-confidence, or hallucination-prone. First clarify SLA for fallback (graceful degradation vs. blocking). Core pillars: (1) uncertainty estimation (calibrated confidence from generator + retrieval hit signals), (2) fallback policy (clarifying question, retrieve relevant docs, invoke tool or hand-off to human), (3) fast safety checks (binary classifiers or rule checks to block dangerous outputs), (4) human-in-the-loop for escalation and labeling. Specific decision: prefer conservative fallback (ask clarifying question) when confidence < threshold, but allow a small percentage of low-confidence automated responses in high-throughput systems gated by audit sampling. Operationalize by logging fallback triggers, collecting labeled examples for retraining, and measuring reduction in bad responses over time.

Common pitfalls

Pitfall: Optimizing the wrong metric. Focusing solely on offline NLL or embedding loss can ignore user-facing metrics like task success or increased downstream human reviews. Always tie design choices to production metrics and iterate with online experiments.

Pitfall: Ignoring p99 and tail behavior. Proposing only median latency improvements fails when customer SLOs depend on tails. Profile and optimize for p95/p99 (e.g., avoid cold starts, warm model instances, and mitigate lock contention).

Pitfall: Over-engineering rerankers. Choosing a large cross-encoder for every query without cascading leads to unacceptable latency and cost; propose cascades or async refinement as a pragmatic compromise.

Connections

Interviewers may pivot to adjacent topics: model training pipelines (how retrievers/rerankers are updated and validated), feature stores for user/context features used in rerankers, or data engineering questions about upstream event schemas and ingestion latency when discussing freshness.

What's being tested

Core knowledge

Worked example — Design a low-latency RAG system

A second angle — Design a chatbot fallback for unknown questions

Common pitfalls

Connections

Further reading

Practice questions

Related concepts