Experiment Design: Pickup ETA Card Redesign
Context: After a rider requests a trip, the app shows a pickup ETA card. The hypothesis is that clearer ETA presentation reduces cancellations and increases completed trips. Design an end-to-end A/B test to evaluate this.
Provide the following:
-
Experimental unit and randomization
-
Choose one: rider-level, request-level, or geo/time cluster.
-
Justify your choice and discuss interference risks (e.g., supply-side coupling, surge dynamics, driver behavior) and how you would mitigate them.
-
Target population and eligibility
-
Precisely define inclusion/exclusion (e.g., first-time vs repeat riders, specific cities, iOS/Android versions).
-
State how you will handle riders with multiple requests during the experiment window.
-
Primary metric and guardrails
-
Select a single primary metric.
-
Propose 2–3 guardrails (e.g., driver acceptance rate, average wait time, surge incidence).
-
Define each metric precisely: numerator, denominator, and measurement window.
-
Instrumentation plan
-
List exact events/fields to log, e.g., request_id, rider_id, device_id, city_id, treatment_arm, event_name, timestamp_ms, cancel_reason, trip_started, trip_completed, driver_supply_at_request, ETA_shown.
-
Include sampling rate, idempotency strategy, and assignment audit plan (including SRM checks).
-
Sample size and duration
-
Assume: baseline booking-to-trip-completion rate p0 = 0.120; minimum detectable relative lift = +5%; two-sided α = 0.05; power = 0.80; 1:1 split; 200,000 eligible requests per day (stable).
-
Compute the per-arm sample size for a two-sample difference-in-proportions test and the expected experiment duration (days) to reach it, accounting for 3% data loss.
-
Show the sample size math clearly.
-
Ramp strategy
-
Propose a safe ramp (e.g., 1% → 10% → 50% → 100%), pre-specified stopping rules, and monitoring (including SRM and guardrails) at each ramp.
-
Bias/variance controls
-
Describe controls for seasonality, city heterogeneity (e.g., stratification or CUPED with pre-period completion), and bot/abuse filtering.
-
Success criteria and rollout
-
Pre-register decision thresholds and actions if the primary metric improves but a guardrail degrades.
State any additional assumptions you need.