Production System Design Tradeoffs
Asked of: ML Engineer
Last updated

What's being tested
These prompts probe an ML Engineer’s ability to design production ML systems that trade off latency, cost, and quality while maintaining safety, observability, and updateability. Interviewers expect clear framing (SLA, throughput, failure modes), a decomposition of system components (embeddings/index, reranker, generator, human-in-loop), and concrete operational strategies for monitoring, rollout, and remediation. The emphasis is on pragmatic engineering choices—what to measure, how to reduce p99 latency, when to accept approximate algorithms, and how to validate behavior in production.
Core knowledge
-
Retrieval-augmented generation (RAG) architecture: separate retriever (fast, high-recall) and generator (slow, high-precision). Retrieval reduces generator input length and cost; reranking trades latency for accuracy.
-
Embedding indexes and ANN methods: use HNSW for high-accuracy low-latency at medium scale,
IVF-PQ/product quantization for memory compression at >10M vectors,OPQfor rotation gains. CiteFAISSorMilvusas implementations. -
Vector storage sizing: 10M vectors × 1536 dims × 4 bytes ≈ 60GB raw; quantization can reduce 8–16×. Plan host/replication accordingly and account for in-memory warm caches for
p99. -
Latency decomposition: . Optimize largest contributors first; reduce retrieval candidates
kto cut reranker/generator cost. -
Reranker designs: lightweight cross-encoder reranker for top-100 candidates, dense dot-product for >10k, cascade: cheap lexical filter → dense retriever → cross-encoder. Cross-encoders often increase latency but boost MRR.
-
Cache and TTL strategies: cache hits reduce
p99; use shardedRedisor local LRU with cache-hit probability to estimate savings: -
Freshness vs. indexing cost: trade frequency of index rebuilds vs. staleness; consider incremental updates (append-only + background
merge) for document churn. For high-churn sources, favor nearline embeddings and short TTL retrieval caches. -
Safety and filtering: implement lightweight classifier pre-filters (binary fast model) and heavier human review for borderline cases; log all model decisions for audit and retraining. Monitor false negatives on safety-critical classes.
-
Evaluation metrics: use recall@k, MRR, precision@k for retrieval; response latency (p50, p95, p99) and cost-per-query for infra; user-centric metrics like task success if available. Track offline vs. online metric drift.
-
Online testing & rollout: prefer shadow testing and canary traffic, gradual ramp with feature-flag gating, and automatic rollback on signal degradation. For model selection, run A/B with statistical power calculations (alpha control).
-
Drift detection: measure embedding distribution shift (mean cosine distance), label distribution shifts, and sudden metric drops. Trigger retraining when shifts exceed threshold or downstream metrics degrade.
-
Auditability/versioning: record model versions, embedding schema, index build parameters, and deterministic seeding for rerankers; include request/response hashes and input document ids for later reproductions.
Worked example — Design a low-latency RAG system
First 30 seconds: clarify SLA (e.g., p95 < 300ms), throughput (qps), dataset size (10M vs 100M docs), and acceptable cost per request. State assumptions: embeddings are 1536-d, retrieval candidates k=50, generator is a 6B LLM with ~150ms token latency.
Skeleton answer pillars: (1) Retrieval stack (ANN index + sharded replica layout), (2) Reranking cascade (light lexical → dense → cross-encoder on top-k), (3) Generation and token budgeting (prompt truncation, response length caps), (4) Caching and pre-warming (query-result & partial generation caching), (5) Monitoring & rollout (p99 latency, recall@k, shadow tests). A concrete tradeoff: reduce k to 10 to cut generator invocations and reranker cost at the expense of recall; compensate with a stronger retriever or synthetic augmentation. For latency, propose moving cross-encoder to asynchronous rerank for immediate safe fallback and later refined response. Close by noting next steps: simulate QPS with realistic queries, profile per-component latencies, and if more time, add adaptive-k retrieval and cost-based routing (cold vs. warm paths).
A second angle — Design a chatbot fallback for unknown questions
Frame: define "unknown" — out-of-distribution, low-confidence, or hallucination-prone. First clarify SLA for fallback (graceful degradation vs. blocking). Core pillars: (1) uncertainty estimation (calibrated confidence from generator + retrieval hit signals), (2) fallback policy (clarifying question, retrieve relevant docs, invoke tool or hand-off to human), (3) fast safety checks (binary classifiers or rule checks to block dangerous outputs), (4) human-in-the-loop for escalation and labeling. Specific decision: prefer conservative fallback (ask clarifying question) when confidence < threshold, but allow a small percentage of low-confidence automated responses in high-throughput systems gated by audit sampling. Operationalize by logging fallback triggers, collecting labeled examples for retraining, and measuring reduction in bad responses over time.
Common pitfalls
Pitfall: Optimizing the wrong metric. Focusing solely on offline NLL or embedding loss can ignore user-facing metrics like task success or increased downstream human reviews. Always tie design choices to production metrics and iterate with online experiments.
Pitfall: Ignoring
p99and tail behavior. Proposing only median latency improvements fails when customer SLOs depend on tails. Profile and optimize forp95/p99(e.g., avoid cold starts, warm model instances, and mitigate lock contention).
Pitfall: Over-engineering rerankers. Choosing a large cross-encoder for every query without cascading leads to unacceptable latency and cost; propose cascades or async refinement as a pragmatic compromise.
Connections
Interviewers may pivot to adjacent topics: model training pipelines (how retrievers/rerankers are updated and validated), feature stores for user/context features used in rerankers, or data engineering questions about upstream event schemas and ingestion latency when discussing freshness.
Further reading
-
FAISS (Facebook AI Similarity Search) GitHub — practical ANN implementations and guidance on
IVF/PQand HNSW. -
Efficient and Robust Approximate Nearest Neighbor Search — HNSW paper — foundational algorithm explaining performance/complexity tradeoffs.
Practice questions
- Design a regional surge pricing strategyOpenAI · Machine Learning Engineer · Onsite · hard
- Design a chatbot fallback for unknown questionsOpenAI · Machine Learning Engineer · Onsite · hard
- Design a recommendation system end-to-endOpenAI · Machine Learning Engineer · Onsite · hard
- Design a search query autocomplete systemOpenAI · Machine Learning Engineer · Onsite · hard
- Design an image/video near-duplicate detection systemOpenAI · Machine Learning Engineer · Onsite · hard
- Design a harmful video content moderation systemOpenAI · Machine Learning Engineer · Onsite · hard
- Design an AWS fine-tuning platform for LLMsOpenAI · Machine Learning Engineer · Onsite · hard
- Design an ML search systemOpenAI · Machine Learning Engineer · Technical Screen · hard
- Design a production RAG systemOpenAI · Machine Learning Engineer · Technical Screen · hard
- Design enterprise RAG search systemOpenAI · Machine Learning Engineer · Technical Screen · hard
- Design an enterprise RAG systemOpenAI · Machine Learning Engineer · Technical Screen · hard
- Design a low-latency RAG systemOpenAI · Machine Learning Engineer · Technical Screen · hard
Related concepts
- Scalable Service And Distributed System DesignSystem Design
- Scalable Distributed System ArchitectureSystem Design
- Distributed Systems Consistency And Low-Latency DesignSystem Design
- Storage, Indexing, APIs, And Secure ExecutionSystem Design
- Scalable Backend Architecture And Data ModelingSystem Design
- Production ML Pipelines And System DesignML System Design