Machine Learning System Design For Real-Time Decisions

What's being tested

Interviewers are probing whether you can design and evaluate machine learning systems for real-time decisions from a Data Scientist’s perspective: defining the prediction target, choosing features and models, validating offline, calibrating uncertainty, and deciding whether the model improves business outcomes. Uber cares because ETA, CTR, promo targeting, and real-time clustering all directly affect marketplace efficiency, rider trust, driver utilization, and spend efficiency. The emphasis is not on building serving infrastructure; it is on whether you understand how prediction quality, calibration, bias, feedback loops, delayed labels, and online evaluation interact in a production decision system. Strong answers connect model choices to decision metrics, not just offline accuracy.

Core knowledge

Problem framing comes first: define the unit of prediction, decision point, label, and action. For ETA, the unit may be trip segment or full trip; for CTR, an impression; for promo targeting, a user-offer decision. Misdefining the label often causes bigger errors than choosing the wrong model.
Real-time features should be described as signal sources, not pipeline architecture. For ETA, useful signals include origin/destination geohashes, time of day, day of week, route distance, historical speeds, weather, event indicators, driver state, and recent traffic. For CTR, use user history, ad/item metadata, context, position, and recency features.
Time-aware validation is mandatory when data distributions shift. Use train/validation/test splits ordered by time rather than random splits, especially for ETA, CTR, and promo response. A random split can leak future traffic patterns, future user behavior, or post-treatment outcomes into training.
Loss functions should match the decision. ETA often uses MAE, RMSE, MAPE, pinball loss for quantiles, and calibration of prediction intervals. CTR uses log loss, AUC, PR-AUC, calibration error, and lift by score decile. Promo targeting should optimize incremental value: $E[\text{profit}] = p(\text{incremental conversion}) \cdot \text{margin} - \text{promo cost}.$
Calibration matters when scores drive thresholds, budgets, or user-facing estimates. A CTR model with good AUC but poor calibration can overspend ad budget; an ETA model with biased underestimates damages rider trust. Use reliability curves, expected calibration error, isotonic regression, Platt scaling, and segment-level calibration checks.
Uncertainty quantification is central for real-time decisions. ETA should often output intervals, not just a point estimate: “pickup in 4–6 minutes” may be better than “5 minutes.” Common approaches include quantile regression, conformal prediction, bootstrapped ensembles, and residual models by route/time segment.
Model families should be justified by data shape and latency needs from an analytical perspective. XGBoost or LightGBM often work well for tabular CTR, ETA, and promo response baselines; neural networks can help with sparse high-cardinality features and embeddings; simple linear/logistic models remain valuable for interpretability and calibration baselines.
Class imbalance is not solved by accuracy. CTR and promo conversion may have positive rates below 1–5%, so accuracy can be meaningless. Prefer log loss, PR-AUC, lift, recall at fixed precision, calibration by score bucket, and business metrics such as revenue per impression or cost per incremental conversion.
Delayed feedback creates biased labels. A user may click immediately, convert hours later, or complete a ride after dispatch. Define attribution windows, account for right-censoring, and compare mature-label evaluation versus early-label proxies. For promo targeting, delayed redemption and retention effects can change the apparent winner.
Causal evaluation is required when the model chooses interventions. Promo targeting is not just predicting who will use a coupon; it is estimating who changes behavior because of the coupon. Use randomized experiments when possible, or off-policy methods such as inverse propensity weighting: $\hat{V}_{IPS} = \frac{1}{n}\sum_i \frac{\mathbb{1}(a_i=\pi(x_i))r_i}{p(a_i|x_i)}.$
Feedback loops appear when predictions influence future data. ETA affects rider cancellation, driver routing, and marketplace matching; CTR ranking affects what users see and therefore what labels are collected; promo targeting changes future user purchase behavior. Call out exploration, randomized holdouts, and monitoring by cohort to detect self-reinforcing bias.
Segment-level evaluation is often where strong DS candidates stand out. Report aggregate metrics and slices: city, airport versus non-airport, peak versus off-peak, new versus returning users, long-tail routes, device type, cold-start users, and high-value cohorts. A model that improves global MAE but worsens airport pickup ETAs may be unacceptable.

Worked example

For Design ETA prediction for Uber rides, a strong candidate would start by clarifying: “Are we predicting time to pickup, time to destination, or total trip duration? Is this pre-dispatch, post-dispatch, or continuously updated during the ride? What is the product goal: reduce absolute error, reduce underestimation, or improve cancellation and trust?” Then they would state assumptions, such as predicting pickup ETA at request time for rider-facing display.

The answer skeleton should have four pillars: label definition, feature design, model/evaluation, and monitoring/experimentation. For labels, use actual elapsed time between request and pickup, with careful treatment of cancellations and reassignment. For features, mention geospatial origin/destination, route distance, time-of-week, historical speed, driver proximity, recent traffic, weather, and event indicators. For modeling, propose a strong tabular baseline such as LightGBM, compare against route-segment historical averages, and consider quantile regression for uncertainty intervals.

A specific tradeoff to flag is bias versus variance in user-facing ETA: underestimating ETA may increase rider frustration and cancellations, while overestimating may reduce conversion even if the trip would have arrived sooner. Therefore, do not optimize only global MAE; evaluate signed error, percent within one or two minutes, calibration of intervals, and underprediction rate by segment. The online test should track rider cancellation, driver wait time, completed trips, support contacts, and marketplace guardrails. A strong close would be: “If I had more time, I’d add continuous updating during the trip, conformal intervals by city/time segment, and a randomized experiment comparing rider trust and marketplace outcomes, not just prediction error.”

A second angle

For **Select the better $5 promo-targeting model**, the same real-time decision discipline applies, but the target is an intervention rather than a passive prediction. A tempting model predicts who is most likely to redeem a coupon, but the better framing is **uplift modeling**: who is likely to take an incremental trip because of the$ 5 offer. Validation must account for treatment assignment, budget constraints, delayed redemption, and possible leakage from post-offer behavior. Instead of ranking by conversion probability alone, rank by expected incremental profit: incremental conversion probability times margin minus coupon cost. The model may have slightly worse AUC for redemption but be better if it avoids giving discounts to users who would have converted anyway.

Common pitfalls

Pitfall: Optimizing the wrong metric.

A common analytical mistake is saying “I would choose the model with the lowest RMSE” or “highest AUC” without connecting it to the decision. For ETA, signed bias and interval calibration may matter more than small RMSE gains; for promo targeting, incremental profit matters more than raw conversion prediction. A better answer names the offline metric, the business metric, and the guardrails.

Pitfall: Treating real-time ML design as infrastructure design.

Some candidates drift into Kafka, feature store replication, service latency, or retry mechanics. For a Data Scientist interview, keep those as assumptions and focus on signal quality, label definition, validation, model behavior, calibration, and experiment design. You can say “assuming these features are available at decision time,” then analyze whether they are predictive, leaky, stable, and fair across segments.

Pitfall: Ignoring leakage and delayed labels.

Wrong-but-tempting answers use features such as final route duration, post-click engagement, redemption status, or completed-trip attributes that would not exist at prediction time. The stronger move is to explicitly separate pre-decision features from post-outcome data, use time-based splits, and define label maturity windows. This shows production judgment without needing to design the data pipeline.

Connections

Interviewers may pivot from here into experimentation, especially A/B test design, guardrail metrics, heterogeneous treatment effects, and launch decisions. They may also probe causal inference, ranking evaluation, calibration, anomaly diagnosis, or model monitoring by cohort. For Uber-specific contexts, expect follow-ups on marketplace metrics such as cancellations, completed trips, rider wait time, driver utilization, and promo budget efficiency.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts