ETA Evaluation And Prediction
Asked of: Data Scientist
Last updated

What's being tested
Uber ETA questions test whether a Data Scientist can evaluate prediction quality, design marketplace-safe experiments, and connect model changes to rider, driver, and business outcomes. Interviewers are probing for more than “lower MAE is good”: they want to see if you understand calibration, uncertainty, conversion impact, cancellation behavior, and two-sided marketplace interference. Uber cares because ETA errors directly affect rider trust, dispatch efficiency, airport throughput, driver utilization, and marketplace liquidity. Strong answers separate prediction evaluation, causal impact, and operational diagnosis instead of mixing them into one vague “ETA improved” story.
Core knowledge
-
ETA labels must be defined precisely: request-to-pickup ETA, pickup-to-dropoff ETA, or total trip duration. The target should match the user-facing promise, e.g. actual wait time , with careful treatment of cancellations, reassignment, batching, and no-shows.
-
Point prediction metrics capture different failure modes.
MAEis interpretable in minutes,RMSEpenalizes large misses, median absolute error is robust to airport/event outliers, and bias is Always segment by city, time of day, weather, airport, and trip type. -
Calibration matters as much as accuracy when ETAs are shown to users. If trips predicted as 5 minutes actually average 7 minutes, the model is underestimating and may inflate conversion while increasing cancellations. Reliability curves by ETA bucket are often more useful than a single aggregate score.
-
Uncertainty intervals are useful for dispatch and UX decisions. A 90% prediction interval should contain the realized arrival time about 90% of the time; evaluate with coverage, interval width, and pinball loss for quantiles. Quantile regression, conformal prediction, and calibrated residual models are common DS-level tools.
-
Business metrics should separate marketplace outcomes from model-quality outcomes. Common primary metrics include request conversion, cancellation rate, completed trips, pickup delay, and rider
ETAsatisfaction; guardrails include driver idle time, acceptance rate, surge exposure, gross bookings, and support contacts. -
Interference violates standard A/B assumptions because one rider’s ETA treatment can affect nearby drivers and other riders. If treatment changes dispatch or demand, user-level randomization can contaminate control. Consider geo-cluster randomization, switchback experiments, or city/time-cell designs to reduce spillovers.
-
Switchback experiments randomize treatment by market and time block, such as city-zone-hour cells. They are well suited to marketplace systems where supply is shared. Analyze using cluster-robust standard errors or regression with time and geography fixed effects, not naïve row-level standard errors.
-
Intent-to-treat and treatment-on-treated answer different questions.
ITTestimates the effect of assignment: If only some users actually see the ETA variant,TOTmay require compliance adjustment or instrumental variables, with assignment as the instrument. -
Power analysis must account for clustering and serial correlation. Cluster designs inflate variance by the design effect where is cluster size and is intra-cluster correlation. For switchbacks, more independent time blocks often matter more than more events within each block.
-
Offline model evaluation should use temporal holdouts and realistic slices. Random row splits can leak traffic patterns because adjacent trips share road conditions, weather, and demand shocks. Prefer train on past weeks, validate on future weeks, then stress-test holidays, airports, concerts, rain, and low-supply periods.
-
Causal analysis of ETA display changes must handle selection. Riders who see long ETAs may abandon before trip creation, so completed-trip-only analysis is biased. Define the funnel from app open or request screen exposure through conversion, cancellation, pickup, and completion.
-
Time-series analytics often uses rolling baselines to distinguish product effects from seasonality. For example, compare 7-day median ETA by market and hour, but avoid mixing local time zones. Use
DATE_TRUNC, explicit timezone conversion, and percentile functions such aspercentile_contcarefully.
Worked example
For Design an ETA experiment under interference, start by clarifying the treatment: is Uber changing the displayed ETA, the prediction model behind it, or the dispatch policy using it? Then define the unit of randomization, because rider-level randomization is likely invalid if treated riders change driver allocation, pickup congestion, or demand in the same neighborhood. A strong answer would organize around four pillars: estimand, randomization design, metrics, and inference. For the estimand, say whether the goal is the effect of assigning a market-time cell to the new ETA system on conversion, cancellation, and realized wait time. For design, propose a switchback or geo-time cluster experiment, e.g. randomize zone-hour blocks within matched markets while avoiding tiny cells with unstable supply. For metrics, choose a primary metric such as completed request conversion, plus guardrails like pickup lateness, driver utilization, cancellation, and ETA calibration. For inference, use cluster-robust standard errors or a regression with geography and time fixed effects, and pre-register the analysis window and exclusion rules. A key tradeoff is that larger clusters reduce interference but lower statistical power, so you would justify cluster size using observed driver movement and historical intra-cluster correlation. Close by saying that, with more time, you would run heterogeneity analysis for airports, peak commute, low-supply periods, and new riders, because the average effect can hide trust-damaging underestimation in critical segments.
A second angle
For Compute ETA shift and conversion uplift, the same concept becomes an analytics execution problem rather than an experiment-design problem. You still need clear definitions: ETA shift could mean change in predicted ETA, actual wait time, or prediction error, and conversion uplift should be measured from the same exposure population, not only completed trips. A good answer would build a date spine, compute timezone-aware daily or hourly metrics, and compare treatment versus control using conversion-rate differences or a regression adjustment. The interviewer may care about whether you use rolling 7-day medians to reduce noise, but the DS judgment is in avoiding biased denominators and segment-mixing. The transferable principle is that ETA analysis always needs both prediction-quality metrics and behavioral outcome metrics.
Common pitfalls
Pitfall: Treating lower average ETA as automatically good.
A tempting answer is “if predicted ETA decreases and conversion increases, the model improved.” That may simply mean the product is underpromising wait times to increase clicks, causing later cancellations and lower trust. A stronger answer checks actual wait time, prediction bias, calibration by ETA bucket, cancellation after request, and support complaints.
Pitfall: Ignoring marketplace interference.
Naïve user-level A/B testing assumes one user’s treatment does not affect another user’s outcome. In ride-hailing, ETA changes can shift demand, driver assignment, airport queue behavior, and surge exposure. Say explicitly why SUTVA may fail and propose cluster, switchback, or matched-market designs with appropriate inference.
Pitfall: Over-indexing on model architecture.
For a Data Scientist interview, do not spend most of the answer on deep learning layers, map-matching internals, or serving infrastructure. It is better to explain labels, metrics, offline-vs-online gaps, calibration, experiment design, and causal interpretation. Mention model families like XGBoost, gradient-boosted trees, or quantile models only to support evaluation tradeoffs.
Connections
Interviewers can pivot from ETA evaluation into marketplace experimentation, causal inference under interference, forecast calibration, ranking/model evaluation, or funnel analytics. They may also ask for SQL/Python execution details, especially rolling medians, cohort conversion, timezone-aware aggregation, and treatment-control comparisons.
Further reading
-
Trustworthy Online Controlled Experiments — Kohavi, Tang, Xu — Practical reference for experiment design, metrics, guardrails, and inference.
-
Causal Inference: What If — Hernán and Robins — Clear grounding for estimands, bias, and causal assumptions.
-
Conformal Prediction for Reliable Machine Learning — Balasubramanian, Ho, Vovk — Useful background for uncertainty intervals and coverage guarantees.
Featured in interview prep guides
Practice questions
- Evaluate ETA Impact on ConversionUber · Data Scientist · Technical Screen · medium
- How to evaluate lowering ETA?Uber · Data Scientist · Technical Screen · medium
- Design a Ride-Hailing ETA SystemUber · Data Scientist · Technical Screen · medium
- Design ETA prediction for Uber ridesUber · Data Scientist · Technical Screen · hard
- Design airport dispatch with ETA uncertaintyUber · Data Scientist · Technical Screen · Medium
- Estimate price–ETA trade-offs causallyUber · Data Scientist · Onsite · hard
- Compute ETA shift and conversion upliftUber · Data Scientist · Technical Screen · Medium
- Design an ETA experiment under interferenceUber · Data Scientist · Technical Screen · hard
- Improve Estimated Time of Arrival for Uber RidersUber · Data Scientist · Technical Screen · hard
- Evaluate New Model's Impact on Rider and Driver ExperienceUber · Data Scientist · Technical Screen · hard
- Measure Impact of Updated Rider ETA AlgorithmUber · Data Scientist · Technical Screen · hard
Related concepts
- Product Metrics And Marketplace DiagnosticsAnalytics & Experimentation
- Predictive Modeling For Delivery And Marketplace DecisionsMachine Learning
- Survival Analysis And Time-To-Event ModelingStatistics & Math
- Switchback Experiments And Marketplace InterferenceAnalytics & Experimentation
- Real-Time Distributed Geospatial And Event SystemsSystem Design
- Delivery Driver Payment And Cost SystemsSystem Design