ETA Evaluation And Prediction

What's being tested

Uber ETA questions test whether a Data Scientist can evaluate prediction quality, design marketplace-safe experiments, and connect model changes to rider, driver, and business outcomes. Interviewers are probing for more than “lower MAE is good”: they want to see if you understand calibration, uncertainty, conversion impact, cancellation behavior, and two-sided marketplace interference. Uber cares because ETA errors directly affect rider trust, dispatch efficiency, airport throughput, driver utilization, and marketplace liquidity. Strong answers separate prediction evaluation, causal impact, and operational diagnosis instead of mixing them into one vague “ETA improved” story.

Core knowledge

ETA labels must be defined precisely: request-to-pickup ETA, pickup-to-dropoff ETA, or total trip duration. The target should match the user-facing promise, e.g. actual wait time $= t_\text{pickup} - t_\text{request}$ , with careful treatment of cancellations, reassignment, batching, and no-shows.
Point prediction metrics capture different failure modes. MAE is interpretable in minutes, RMSE penalizes large misses, median absolute error is robust to airport/event outliers, and bias is $\text{Bias} = \frac{1}{n}\sum_i(\hat{y}_i-y_i).$ Always segment by city, time of day, weather, airport, and trip type.
Calibration matters as much as accuracy when ETAs are shown to users. If trips predicted as 5 minutes actually average 7 minutes, the model is underestimating and may inflate conversion while increasing cancellations. Reliability curves by ETA bucket are often more useful than a single aggregate score.
Uncertainty intervals are useful for dispatch and UX decisions. A 90% prediction interval should contain the realized arrival time about 90% of the time; evaluate with coverage, interval width, and pinball loss for quantiles. Quantile regression, conformal prediction, and calibrated residual models are common DS-level tools.
Business metrics should separate marketplace outcomes from model-quality outcomes. Common primary metrics include request conversion, cancellation rate, completed trips, pickup delay, and rider ETA satisfaction; guardrails include driver idle time, acceptance rate, surge exposure, gross bookings, and support contacts.
Interference violates standard A/B assumptions because one rider’s ETA treatment can affect nearby drivers and other riders. If treatment changes dispatch or demand, user-level randomization can contaminate control. Consider geo-cluster randomization, switchback experiments, or city/time-cell designs to reduce spillovers.
Switchback experiments randomize treatment by market and time block, such as city-zone-hour cells. They are well suited to marketplace systems where supply is shared. Analyze using cluster-robust standard errors or regression with time and geography fixed effects, not naïve row-level standard errors.
Intent-to-treat and treatment-on-treated answer different questions. ITT estimates the effect of assignment: $E[Y|Z=1]-E[Y|Z=0].$ If only some users actually see the ETA variant, TOT may require compliance adjustment or instrumental variables, with assignment as the instrument.
Power analysis must account for clustering and serial correlation. Cluster designs inflate variance by the design effect $DEFF = 1 + (m-1)\rho,$ where $m$ is cluster size and $\rho$ is intra-cluster correlation. For switchbacks, more independent time blocks often matter more than more events within each block.
Offline model evaluation should use temporal holdouts and realistic slices. Random row splits can leak traffic patterns because adjacent trips share road conditions, weather, and demand shocks. Prefer train on past weeks, validate on future weeks, then stress-test holidays, airports, concerts, rain, and low-supply periods.
Causal analysis of ETA display changes must handle selection. Riders who see long ETAs may abandon before trip creation, so completed-trip-only analysis is biased. Define the funnel from app open or request screen exposure through conversion, cancellation, pickup, and completion.
Time-series analytics often uses rolling baselines to distinguish product effects from seasonality. For example, compare 7-day median ETA by market and hour, but avoid mixing local time zones. Use DATE_TRUNC, explicit timezone conversion, and percentile functions such as percentile_cont carefully.

Worked example

For Design an ETA experiment under interference, start by clarifying the treatment: is Uber changing the displayed ETA, the prediction model behind it, or the dispatch policy using it? Then define the unit of randomization, because rider-level randomization is likely invalid if treated riders change driver allocation, pickup congestion, or demand in the same neighborhood. A strong answer would organize around four pillars: estimand, randomization design, metrics, and inference. For the estimand, say whether the goal is the effect of assigning a market-time cell to the new ETA system on conversion, cancellation, and realized wait time. For design, propose a switchback or geo-time cluster experiment, e.g. randomize zone-hour blocks within matched markets while avoiding tiny cells with unstable supply. For metrics, choose a primary metric such as completed request conversion, plus guardrails like pickup lateness, driver utilization, cancellation, and ETA calibration. For inference, use cluster-robust standard errors or a regression with geography and time fixed effects, and pre-register the analysis window and exclusion rules. A key tradeoff is that larger clusters reduce interference but lower statistical power, so you would justify cluster size using observed driver movement and historical intra-cluster correlation. Close by saying that, with more time, you would run heterogeneity analysis for airports, peak commute, low-supply periods, and new riders, because the average effect can hide trust-damaging underestimation in critical segments.

A second angle

For Compute ETA shift and conversion uplift, the same concept becomes an analytics execution problem rather than an experiment-design problem. You still need clear definitions: ETA shift could mean change in predicted ETA, actual wait time, or prediction error, and conversion uplift should be measured from the same exposure population, not only completed trips. A good answer would build a date spine, compute timezone-aware daily or hourly metrics, and compare treatment versus control using conversion-rate differences or a regression adjustment. The interviewer may care about whether you use rolling 7-day medians to reduce noise, but the DS judgment is in avoiding biased denominators and segment-mixing. The transferable principle is that ETA analysis always needs both prediction-quality metrics and behavioral outcome metrics.

Common pitfalls

Pitfall: Treating lower average ETA as automatically good.

A tempting answer is “if predicted ETA decreases and conversion increases, the model improved.” That may simply mean the product is underpromising wait times to increase clicks, causing later cancellations and lower trust. A stronger answer checks actual wait time, prediction bias, calibration by ETA bucket, cancellation after request, and support complaints.

Pitfall: Ignoring marketplace interference.

Naïve user-level A/B testing assumes one user’s treatment does not affect another user’s outcome. In ride-hailing, ETA changes can shift demand, driver assignment, airport queue behavior, and surge exposure. Say explicitly why SUTVA may fail and propose cluster, switchback, or matched-market designs with appropriate inference.

Pitfall: Over-indexing on model architecture.

For a Data Scientist interview, do not spend most of the answer on deep learning layers, map-matching internals, or serving infrastructure. It is better to explain labels, metrics, offline-vs-online gaps, calibration, experiment design, and causal interpretation. Mention model families like XGBoost, gradient-boosted trees, or quantile models only to support evaluation tradeoffs.

Connections

Interviewers can pivot from ETA evaluation into marketplace experimentation, causal inference under interference, forecast calibration, ranking/model evaluation, or funnel analytics. They may also ask for SQL/Python execution details, especially rolling medians, cohort conversion, timezone-aware aggregation, and treatment-control comparisons.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts