Reserving an Elevator for Food Deliveries
Company: Amazon
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: medium
Interview Round: Technical Screen
## Reserving an Elevator for Food Deliveries
You are a data scientist working with a building-operations company that manages large residential apartment towers. A building owner has noticed that a large and growing share of elevator trips are made by **food-delivery couriers** entering the building to drop off orders. These courier trips are short, frequent, and often coincide with peak meal times, and the owner believes they crowd out residents and increase resident wait times.
The owner proposes an intervention: in a tower with **6 elevators**, **reserve one elevator exclusively for food-delivery couriers**, leaving residents 5 general-use elevators. Couriers would be routed (via signage and the delivery-app pickup instructions) to use only the reserved car.
Design an A/B test to evaluate this policy, and define how you would measure success.
```hint Where to start
The owner's stated goal (reduce resident wait time / crowding) and the obvious cost (residents lose a sixth of general capacity; couriers may be funneled into one slow car) point in opposite directions. Name the **primary metric** that captures the goal and the **guardrail metrics** that capture the costs before you touch the design.
```
```hint Unit of randomization
The treatment is a physical, building-level change — you cannot randomize individual residents within one tower (everyone shares the same 6 elevators). Think about what the smallest independently-treatable unit actually is, and what that implies for sample size and analysis (clustered / switchback designs).
```
```hint Measuring "success" for two populations
There are two distinct populations with potentially opposing outcomes: **residents** and **couriers**. A win for one can be a loss for the other. Decide whose experience is the primary success metric and whose is a guardrail, and define a metric you can actually instrument (e.g. wait time from button-press to car arrival).
```
### Constraints & Assumptions
- A tower has 6 elevators; the policy converts 1 of them into a courier-only car (residents keep 5).
- The company operates a portfolio of many similar towers (tens to low hundreds), which is what makes a between-building experiment feasible.
- Elevator controllers log every **call** (button press, originating floor, timestamp) and every **arrival** (car, floor, timestamp), so per-trip wait time is instrumentable.
- Food-delivery couriers can be (imperfectly) identified — e.g. via building-entry QR check-in used by delivery apps, or by trip patterns (lobby-to-upper-floor-and-back in a short window).
- Compliance is imperfect: some couriers will still take general elevators, and some residents may take the courier car when it is idle.
### Clarifying Questions to Ask
- What is the precise primary objective — reduce **resident** wait time, reduce overall crowding/complaints, improve courier throughput, or something else? Whose experience defines "success"?
- How many towers are available, and how similar are they in size, floor count, resident count, and delivery volume? Is there a pre-period of logged data for each?
- How are couriers identified at the point of an elevator call, and how reliable is that identification? Is compliance enforced or voluntary?
- What is the expected effect size the owner would consider worth the operational cost, and what wait-time degradation for couriers (or residents) would make the policy unacceptable?
- Are there strong time-of-day and day-of-week patterns (meal-time peaks) that the design must account for?
- Is there a risk of spillover or novelty (residents/couriers behaving differently just because the policy is new)?
### What a Strong Answer Covers
- **Metric design**: a clearly named primary metric (e.g. resident wait time per call), guardrail metrics (courier wait time, overall throughput, resident complaints/NPS), and counter-metrics that catch harm.
- **Unit of randomization**: recognition that this is a cluster-level (building-level) treatment, not a user-level one, and the consequences for power.
- **Design choice**: a defensible design — cluster-randomized across towers, or a within-building **switchback** (time-sliced on/off) to control for between-building heterogeneity — with the trade-offs of each.
- **Sample size / power & MDE**: how the cluster count or switchback period length limits the minimum detectable effect, and accounting for within-cluster correlation.
- **Confounders & validity threats**: meal-time seasonality, compliance/non-adherence, spillover, novelty/primacy effects, and how the analysis handles them (e.g. intent-to-treat).
- **Heterogeneity**: peak vs off-peak, tall vs short buildings, high- vs low-delivery towers.
- **Decision rule**: what combination of primary lift and guardrail constraints would lead to a ship / no-ship / iterate decision.
### Follow-up Questions
- The owner only has 8 towers available and meal-time peaks dominate the signal. Would you still run a between-building test, or switch to a switchback design? Walk through the trade-offs and how you'd analyze each.
- Compliance is only ~60% (many couriers ignore the reserved car). How does non-adherence bias your estimate, and how would you report intent-to-treat vs. treatment-on-the-treated effects?
- Suppose resident wait time improves but courier delivery times get much worse, raising delivery-app complaints that hurt the building's reputation. How do you frame the trade-off and recommend a decision?
Quick Answer: This question tests a data scientist's ability to design a rigorous A/B experiment for a real-world operational policy with competing stakeholder outcomes. It evaluates expertise in experiment design concepts including cluster randomization, switchback designs, metric selection, guardrail metrics, and intent-to-treat analysis — core skills assessed in analytics and experimentation interviews.