Machine Learning System Design For Real-Time Decisions
Asked of: Data Scientist
Last updated

What's being tested
Interviewers are probing whether you can design and evaluate machine learning systems for real-time decisions from a Data Scientist’s perspective: defining the prediction target, choosing features and models, validating offline, calibrating uncertainty, and deciding whether the model improves business outcomes. Uber cares because ETA, CTR, promo targeting, and real-time clustering all directly affect marketplace efficiency, rider trust, driver utilization, and spend efficiency. The emphasis is not on building serving infrastructure; it is on whether you understand how prediction quality, calibration, bias, feedback loops, delayed labels, and online evaluation interact in a production decision system. Strong answers connect model choices to decision metrics, not just offline accuracy.
Core knowledge
-
Problem framing comes first: define the unit of prediction, decision point, label, and action. For ETA, the unit may be trip segment or full trip; for CTR, an impression; for promo targeting, a user-offer decision. Misdefining the label often causes bigger errors than choosing the wrong model.
-
Real-time features should be described as signal sources, not pipeline architecture. For ETA, useful signals include origin/destination geohashes, time of day, day of week, route distance, historical speeds, weather, event indicators, driver state, and recent traffic. For CTR, use user history, ad/item metadata, context, position, and recency features.
-
Time-aware validation is mandatory when data distributions shift. Use train/validation/test splits ordered by time rather than random splits, especially for ETA, CTR, and promo response. A random split can leak future traffic patterns, future user behavior, or post-treatment outcomes into training.
-
Loss functions should match the decision. ETA often uses
MAE,RMSE,MAPE, pinball loss for quantiles, and calibration of prediction intervals. CTR useslog loss,AUC,PR-AUC, calibration error, and lift by score decile. Promo targeting should optimize incremental value: -
Calibration matters when scores drive thresholds, budgets, or user-facing estimates. A CTR model with good
AUCbut poor calibration can overspend ad budget; an ETA model with biased underestimates damages rider trust. Use reliability curves, expected calibration error, isotonic regression, Platt scaling, and segment-level calibration checks. -
Uncertainty quantification is central for real-time decisions. ETA should often output intervals, not just a point estimate: “pickup in 4–6 minutes” may be better than “5 minutes.” Common approaches include quantile regression, conformal prediction, bootstrapped ensembles, and residual models by route/time segment.
-
Model families should be justified by data shape and latency needs from an analytical perspective.
XGBoostorLightGBMoften work well for tabular CTR, ETA, and promo response baselines; neural networks can help with sparse high-cardinality features and embeddings; simple linear/logistic models remain valuable for interpretability and calibration baselines. -
Class imbalance is not solved by accuracy. CTR and promo conversion may have positive rates below 1–5%, so accuracy can be meaningless. Prefer
log loss,PR-AUC, lift, recall at fixed precision, calibration by score bucket, and business metrics such as revenue per impression or cost per incremental conversion. -
Delayed feedback creates biased labels. A user may click immediately, convert hours later, or complete a ride after dispatch. Define attribution windows, account for right-censoring, and compare mature-label evaluation versus early-label proxies. For promo targeting, delayed redemption and retention effects can change the apparent winner.
-
Causal evaluation is required when the model chooses interventions. Promo targeting is not just predicting who will use a coupon; it is estimating who changes behavior because of the coupon. Use randomized experiments when possible, or off-policy methods such as inverse propensity weighting:
-
Feedback loops appear when predictions influence future data. ETA affects rider cancellation, driver routing, and marketplace matching; CTR ranking affects what users see and therefore what labels are collected; promo targeting changes future user purchase behavior. Call out exploration, randomized holdouts, and monitoring by cohort to detect self-reinforcing bias.
-
Segment-level evaluation is often where strong DS candidates stand out. Report aggregate metrics and slices: city, airport versus non-airport, peak versus off-peak, new versus returning users, long-tail routes, device type, cold-start users, and high-value cohorts. A model that improves global
MAEbut worsens airport pickup ETAs may be unacceptable.
Worked example
For Design ETA prediction for Uber rides, a strong candidate would start by clarifying: “Are we predicting time to pickup, time to destination, or total trip duration? Is this pre-dispatch, post-dispatch, or continuously updated during the ride? What is the product goal: reduce absolute error, reduce underestimation, or improve cancellation and trust?” Then they would state assumptions, such as predicting pickup ETA at request time for rider-facing display.
The answer skeleton should have four pillars: label definition, feature design, model/evaluation, and monitoring/experimentation. For labels, use actual elapsed time between request and pickup, with careful treatment of cancellations and reassignment. For features, mention geospatial origin/destination, route distance, time-of-week, historical speed, driver proximity, recent traffic, weather, and event indicators. For modeling, propose a strong tabular baseline such as LightGBM, compare against route-segment historical averages, and consider quantile regression for uncertainty intervals.
A specific tradeoff to flag is bias versus variance in user-facing ETA: underestimating ETA may increase rider frustration and cancellations, while overestimating may reduce conversion even if the trip would have arrived sooner. Therefore, do not optimize only global MAE; evaluate signed error, percent within one or two minutes, calibration of intervals, and underprediction rate by segment. The online test should track rider cancellation, driver wait time, completed trips, support contacts, and marketplace guardrails. A strong close would be: “If I had more time, I’d add continuous updating during the trip, conformal intervals by city/time segment, and a randomized experiment comparing rider trust and marketplace outcomes, not just prediction error.”
A second angle
For **Select the better 5 offer. Validation must account for treatment assignment, budget constraints, delayed redemption, and possible leakage from post-offer behavior. Instead of ranking by conversion probability alone, rank by expected incremental profit: incremental conversion probability times margin minus coupon cost. The model may have slightly worse AUC for redemption but be better if it avoids giving discounts to users who would have converted anyway.
Common pitfalls
Pitfall: Optimizing the wrong metric.
A common analytical mistake is saying “I would choose the model with the lowest RMSE” or “highest AUC” without connecting it to the decision. For ETA, signed bias and interval calibration may matter more than small RMSE gains; for promo targeting, incremental profit matters more than raw conversion prediction. A better answer names the offline metric, the business metric, and the guardrails.
Pitfall: Treating real-time ML design as infrastructure design.
Some candidates drift into Kafka, feature store replication, service latency, or retry mechanics. For a Data Scientist interview, keep those as assumptions and focus on signal quality, label definition, validation, model behavior, calibration, and experiment design. You can say “assuming these features are available at decision time,” then analyze whether they are predictive, leaky, stable, and fair across segments.
Pitfall: Ignoring leakage and delayed labels.
Wrong-but-tempting answers use features such as final route duration, post-click engagement, redemption status, or completed-trip attributes that would not exist at prediction time. The stronger move is to explicitly separate pre-decision features from post-outcome data, use time-based splits, and define label maturity windows. This shows production judgment without needing to design the data pipeline.
Connections
Interviewers may pivot from here into experimentation, especially A/B test design, guardrail metrics, heterogeneous treatment effects, and launch decisions. They may also probe causal inference, ranking evaluation, calibration, anomaly diagnosis, or model monitoring by cohort. For Uber-specific contexts, expect follow-ups on marketplace metrics such as cancellations, completed trips, rider wait time, driver utilization, and promo budget efficiency.
Further reading
-
“Hidden Technical Debt in Machine Learning Systems” — classic paper on production ML failure modes, including feedback loops and monitoring complexity.
-
“Practical Lessons from Predicting Clicks on Ads at Facebook” — useful for CTR modeling, calibration, high-cardinality sparse features, and online/offline metric gaps.
-
“A Survey of Methods for Explaining Black Box Models” — helpful background for discussing interpretability and trust in high-impact prediction systems.
Featured in interview prep guides
Practice questions
- Implement Streaming Clustering for NumbersUber · Data Scientist · Onsite · none
- Design a Ride-Hailing ETA SystemUber · Data Scientist · Technical Screen · medium
- Design ETA prediction for Uber ridesUber · Data Scientist · Technical Screen · hard
- Select the better $5 promo-targeting modelUber · Data Scientist · Technical Screen · hard
- Build and assess CTR predictionUber · Data Scientist · Technical Screen · hard
- Optimize Surge Notifications for Rideshare DriversUber · Data Scientist · Technical Screen · hard
Related concepts
- Applied Machine Learning Modeling And EvaluationMachine Learning
- ML System Design, Recommenders, Forecasting And AllocationMachine Learning
- Machine Learning Model Design And EvaluationMachine Learning
- Predictive Modeling For Delivery And Marketplace DecisionsMachine Learning
- Recommender, Ranking, And Ads ML Systems
- Machine Learning Project LifecycleMachine Learning