Production ML Pipelines And System Design

What's being tested

Interviewers are probing whether you can design a production ML system that is trainable, deployable, measurable, and maintainable after launch. For a Machine Learning Engineer at Amazon, this means translating a model idea into a reliable pipeline: data contracts, feature computation, training, evaluation, deployment, monitoring, rollback, and retraining. The focus is not just “which model would you use,” but how you prevent offline/online skew, detect regressions, control latency and cost, and prove the model improves customer-facing outcomes. Strong answers show ownership of ML-specific infrastructure without drifting into raw storage, message-broker, or product-strategy design.

Core knowledge

Problem framing comes before architecture: define prediction target, inference mode, latency budget, freshness requirement, retraining cadence, failure tolerance, and evaluation metric. A fraud model, recommender, and LLM validator can all use pipelines, but their label delay, calibration needs, and serving constraints differ sharply.
Training pipelines should be reproducible: version the dataset snapshot, feature code, labels, model code, hyperparameters, container image, and random seed. In practice, this means using systems like SageMaker Pipelines, Kubeflow Pipelines, MLflow, or Airflow for orchestration and lineage.
Feature engineering must address offline/online parity. If training uses batch-computed aggregates but serving recomputes them differently, you get training-serving skew. A feature store such as SageMaker Feature Store, Feast, or Tecton helps by sharing transformation definitions and maintaining online/offline views.
Label design is often the hardest part. Define observation windows, prediction windows, censoring rules, and leakage boundaries. For example, if predicting conversion within 7 days, features must be computed strictly before exposure time $t$ , and labels should be $y=\mathbb{1}[\text{conversion in }(t,t+7d]]$ .
Model evaluation should combine discrimination, calibration, ranking, and slice metrics. Use AUC-ROC, PR-AUC, NDCG@K, MAP@K, log loss, or Brier score depending on the task. Calibration can be checked with expected calibration error: $ECE=\sum_b \frac{|B_b|}{n}|\text{acc}(B_b)-\text{conf}(B_b)|$ .
Validation gates prevent bad models from deploying. Gates usually include schema checks, feature distribution checks, label leakage tests, offline metric thresholds, regression tests against a baseline, fairness/safety slices, latency benchmarks, and canary results. A model should fail closed if core validation data is missing.
Deployment patterns include batch scoring, synchronous real-time inference, asynchronous inference, shadow deployment, canary rollout, blue/green deployment, and A/B testing. For high-QPS services, latency targets should include p50, p95, and p99, not just average latency.
Serving architecture depends on model size and freshness. Tree models like XGBoost are often cheap to serve with millisecond latency; deep ranking models may need GPU or vector retrieval; LLM systems need prompt templates, retrieval context, safety filters, caching, token budgets, and fallback behavior.
Monitoring must separate data drift, prediction drift, concept drift, and performance decay. Useful signals include feature null rate, embedding norm distribution, population stability index PSI, KL divergence, prediction score histograms, calibration by slice, latency, error rate, and business proxy metrics.
Retraining strategy should match label delay and drift speed. Common choices are scheduled retraining, performance-triggered retraining, active-learning loops, or human-in-the-loop review. Avoid automatic promotion from retraining alone; require evaluation gates and staged deployment.
LLM validation needs layered evaluation: prompt/unit tests, retrieval quality, factuality, toxicity, hallucination rate, jailbreak robustness, latency, cost per request, and human preference. Metrics can include exact match, semantic similarity, rubric-based judge scores, pairwise win rate, and safety violation rate.
Scalability choices should be ML-driven, not infrastructure theater. Use pandas or DuckDB for small offline experiments, distributed Spark or Ray for hundreds of millions of rows, and approximate nearest neighbor indexes like FAISS, ScaNN, or HNSW when embedding retrieval exceeds brute-force feasibility.

Worked example

For Build an end-to-end ML pipeline, a strong candidate starts by clarifying: “What is the prediction target, how quickly do predictions need to be served, how delayed are labels, and what is the cost of false positives versus false negatives?” They would also state assumptions, such as “I’ll assume this is a supervised binary prediction problem with daily retraining, batch feature generation, and online low-latency inference.” The answer should be organized around four pillars: data and label definition, feature generation with offline/online parity, model training and evaluation, and deployment plus monitoring.

The candidate might describe a pipeline that validates input schemas, creates point-in-time-correct training examples, trains a baseline model such as XGBoost or a neural model depending on data shape, evaluates against offline metrics and business-aligned slices, then registers the model in a model registry. Deployment would use a canary or shadow stage before full rollout, with rollback to the previous model if p99 latency, error rate, or calibration exceeds thresholds. One explicit tradeoff to flag is batch versus real-time features: batch features are simpler and cheaper but may be stale; real-time features improve freshness but increase serving complexity and skew risk. A strong close would be: “If I had more time, I’d discuss retraining triggers, human review for ambiguous labels, and how I’d run an online experiment before making the model the default.”

A second angle

For Design an LLM quality validation system, the same production ML principles apply, but the artifacts and failure modes change. Instead of only validating numeric features and supervised labels, you validate prompts, retrieved context, generated text, safety filters, and model-version behavior across regression suites. Offline evaluation may combine curated golden sets, adversarial prompts, rubric-based human review, and LLM-as-judge scoring, while online monitoring tracks hallucination reports, refusal rate, latency, token cost, and safety violations. The key constraint is that LLM quality is multi-dimensional: a model can improve helpfulness while worsening factuality or policy compliance, so promotion needs multiple gates rather than a single aggregate score.

Common pitfalls

Pitfall: Optimizing only for a headline metric like AUC.

A tempting answer is “train a model, compare AUC, deploy if it improves.” That misses calibration, threshold selection, slice regressions, latency, and online impact. A stronger answer maps metrics to the decision being automated: ranking needs NDCG@K, probability decisions need calibration, and customer-impacting launches need staged online validation.

Pitfall: Starting with tools before requirements.

Saying “I’ll use SageMaker, Spark, Kafka, and Kubernetes” without clarifying target, label delay, freshness, and latency sounds shallow. Interviewers want the design to follow from constraints. Lead with the ML contract, then choose tools only where they solve a specific production problem.

Pitfall: Treating monitoring as dashboards only.

“Monitor drift” is not enough. Specify which distributions you monitor, what thresholds trigger alerts, who or what responds, and whether the response is rollback, retraining, feature disablement, or human review. Good monitoring connects symptoms to operational actions.

Connections

Interviewers often pivot from this topic into feature stores, model evaluation and calibration, online experimentation, LLM safety evaluation, or recommender/ranking system design. They may also ask about adjacent storage or serving systems, but for an MLE answer, keep the focus on ML artifacts, model lifecycle, and customer-safe deployment.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts