Design and Analyze Airbnb Locker Experiment
Company: Airbnb
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: medium
Interview Round: Technical Screen
Airbnb is considering launching a **luggage locker feature** that lets guests store their bags before their scheduled check-in time, so they don't have to wait until the room is ready. You are the data scientist asked to design — and then analyze — an experiment that decides whether Airbnb should launch this feature.
The interview has two halves: a **case-study design** discussion (Parts 1–5, 7) and a **hands-on analysis** of a provided experiment dataset (Part 6).
### Constraints & Assumptions
- Airbnb is a **two-sided marketplace**: an intervention can affect guests, hosts, and the platform's revenue simultaneously, and these interests can conflict.
- Lockers are (at least partly) **shared physical infrastructure** with finite capacity in a given location.
- Outcomes span the full booking lifecycle: assignment may happen at booking time, but key outcomes (post-stay rating, support contacts, damage claims) only mature **after checkout**.
- Assume standard significance and power targets unless you argue otherwise: $\alpha = 0.05$ (two-sided) and $80\%$ power.
- You have access to historical baselines (e.g., support-contact rate, booking conversion) to power the test.
### Clarifying Questions to Ask
- Is the locker offered **before booking** (a search/booking feature) or **after booking** (a check-in convenience)? This determines the primary metric and randomization point.
- Are lockers **on-listing** (host-operated, e.g., a smart lock box) or **shared off-site** infrastructure with pooled capacity? This determines whether SUTVA can hold at the reservation level.
- What is the intended **launch mechanism** — a UX toggle, a host opt-in, or a market-wide rollout? The experiment should mirror how it will actually ship.
- Is the feature **free or priced**? Pricing changes both the metric set and the revenue tradeoff.
- What baseline rates do we have for the candidate primary metric, and what effect size would be **worth launching**?
- Are there **safety, liability, or insurance** constraints (lost/damaged luggage) that gate launch regardless of the metric outcome?
### Part 1 — Randomization unit
Should the experiment be randomized on the **guest** side, **reservation** side, **host/listing** side, or **market** side? Justify your choice, and explain how the answer depends on what exactly the treatment changes.
```hint What breaks independence?
The right unit is the smallest one for which the Stable Unit Treatment Value Assumption (SUTVA) still holds — i.e., one unit's treatment doesn't affect another unit's outcome. Ask: can a treated unit *contaminate* a control unit? Two channels do that here — shared **physical capacity** and the **host changing behavior** for everyone at a listing.
```
```hint Tie the unit to the mechanism
Map each candidate unit to the intervention it cleanly isolates: guest/reservation isolates *demand and UX* (does showing the option change behavior?); listing/host isolates a *host-operated rollout*; market/building isolates *shared-capacity* effects. If you cluster, you'll later pay for it in sample size and inference.
```
#### What This Part Should Cover
- Reasons from **SUTVA**, not from a default ("just randomize users").
- Identifies both contamination channels — **shared locker capacity** and **host-level behavior change** — and which unit internalizes each.
- Connects the unit choice to the **launch mechanism** and acknowledges the **power vs. validity** tradeoff (finer = more powerful but riskier).
### Part 2 — Eligibility, treatment, control
Define the **eligible population**, the **treatment** experience, and the **control** experience precisely enough that an engineer could implement the assignment logic.
```hint
Eligibility should be defined *before* and *independently* of treatment (never on a post-assignment behavior like "guests who clicked the locker"). Think about who can even be offered a locker (geography, listing type, arrival-before-check-in likelihood) and whether control is "nothing" or "today's host-managed luggage workaround."
```
#### What This Part Should Cover
- Eligibility defined **pre-assignment** and independent of any post-treatment behavior.
- Treatment defined as the **offer of locker access**, not actual usage.
- Control defined as the **realistic status quo** (current host-managed workaround), not a strawman.
### Part 3 — Metrics
Choose a **primary** metric, a set of **secondary** metrics, and **guardrail** metrics. Discuss the tradeoffs among guest experience, host experience, operational cost, safety, and marketplace revenue.
```hint Pick primary by where the feature lives in the funnel
A pre-booking feature argues for a conversion/GBV primary; a post-booking convenience feature argues for a satisfaction or support-load primary. One sentence to anchor the choice: the primary should be sensitive to the feature, move-able in a realistic test window, and aligned with the launch goal.
```
```hint Guardrails protect the side you're *not* optimizing
Because this is a marketplace, list guardrails on the host/safety/cost side (cancellations, damage/lost-item claims, support cost per reservation, locker over-capacity) so a guest win that quietly burns hosts or trust-and-safety gets caught.
```
#### What This Part Should Cover
- A **single, well-justified primary** tied to the feature's funnel position.
- A **coherent metric tree** (primary / secondary / guardrail), not a flat list of everything measurable.
- **Explicit marketplace tradeoffs** — names the guardrails that protect hosts, safety, and cost, and states a launch decision rule.
### Part 4 — Sample size and duration
Explain how you would compute the **required sample size** and decide the **experiment duration**.
```hint Sample size
For a binary primary (e.g., support-contact rate), use a two-proportion power calculation driven by a *pre-registered* minimum detectable effect (MDE). If you randomize at a cluster (listing/market) rather than the reservation, inflate $n$ by the design effect $\text{DEFF} = 1 + (\bar{m}-1)\rho$, where $\bar{m}$ is average cluster size and $\rho$ is the intraclass correlation.
```
```hint Duration is not just n / traffic
Duration must also cover weekly seasonality (weekday/weekend travel) **and** the booking-to-checkout lag, since post-stay outcomes only mature after guests leave. Low-traffic markets may gate the timeline more than the math does.
```
#### What This Part Should Cover
- Sample size driven by a **pre-registered MDE** and the correct **two-proportion power formula** (for a binary primary).
- **Design-effect inflation** when randomizing at a cluster, with the right intuition for $\rho$ and cluster size.
- Duration reasoning that goes **beyond $n/\text{traffic}$**: weekly seasonality, the booking-to-checkout maturation lag, and thin-market constraints.
### Part 5 — Analysis plan
Describe how you would **estimate the treatment effect** and **compute a p-value** for the primary metric, including how you'd reduce variance and handle clustered randomization.
```hint Estimand first, test second
Default to **intent-to-treat** (compare by *assignment*, not by who actually used the locker) so self-selection doesn't bias you. Then pick a test matched to the metric type (two-proportion z / logistic for binary; t-test / regression / bootstrap for continuous), and report the **CI and effect size**, not just $p$. Variance reduction: covariate adjustment / CUPED on pre-period behavior; if clustered, cluster-robust SEs at the randomization level.
```
#### What This Part Should Cover
- States the **ITT estimand** explicitly and never conditions on post-treatment usage.
- Matches the **test to the metric type** and reports a **CI / effect size**, not just a p-value.
- **Variance reduction** (covariate adjustment / CUPED) and **cluster-robust inference** matched to the actual randomization level.
### Part 6 — Hands-on A/B and A/A test
You are given a **reservation-level experiment table** with these columns:
| column | type | meaning |
|---|---|---|
| `unit_id` | string | randomization unit id |
| `guest_id`, `host_id`, `listing_id` | string | identity / clustering keys |
| `market` | string | market / city |
| `assignment_variant` | string | `treatment` or `control` |
| `assigned_at` | timestamp (UTC) | when the unit was assigned |
| `checkin_at`, `checkout_at` | timestamp (UTC) | stay window |
| `booked` | int | 1 if a booking was made |
| `used_locker` | int | 1 if the guest actually used a locker (post-treatment) |
| `support_contacted` | int | 1 if guest contacted support |
| `post_stay_rating` | float | rating after stay |
| `gross_booking_value` | float | GBV in currency units |
| `pre_assignment_trips` | int | prior trips (pre-period covariate) |
| `pre_assignment_support_rate` | float | prior support rate (pre-period covariate) |
| `is_aa_test` | bool | TRUE if the row belongs to an A/A holdout |
Explain, concretely, how you would (a) run the **A/B test** to estimate the effect and p-value on this table, and (b) run an **A/A test** to detect bias, sample-ratio mismatch, or instrumentation issues. Reference the specific columns you'd use.
```hint A/B
Filter to `is_aa_test == False`, validate one row per `unit_id` and that `assigned_at` precedes outcomes, then run the test on `support_contacted` (or your chosen primary). `used_locker` is **post-treatment** — never condition the main estimate on it.
```
```hint A/A
The A/A rows (`is_aa_test == True`) should show *no* real effect. Two checks: a **sample-ratio-mismatch** test (chi-square of variant counts vs. expected split) and **covariate balance** on `pre_assignment_trips` / `pre_assignment_support_rate` / `market`. In large samples, prefer **standardized mean differences** over raw p-values for balance — tiny imbalances become "significant."
```
#### What This Part Should Cover
- A **column-specific, runnable** A/B procedure: filter on `is_aa_test`, validate `unit_id` uniqueness and that `assigned_at` precedes outcomes, estimate ITT on the chosen primary.
- Treats `used_locker` strictly as **post-treatment** (a diagnostic, never a filter or conditioning variable for the headline).
- An A/A that runs the **SRM (chi-square)** check and **covariate balance** via standardized mean differences, and explains what a failure implies for trusting the live A/B.
### Part 7 — Risks
Discuss the risks that threaten this specific experiment: **selection bias**, **interference** between treatment and control, **locker capacity constraints**, **host-side spillovers**, and **heterogeneous effects** across markets. For each, name how your design or analysis mitigates it.
```hint
Each risk maps to a design lever you already chose: selection bias → ITT; interference/capacity → choice of randomization unit (cluster) + capacity-aware analysis; host spillovers → listing/host-level unit; heterogeneity → *pre-specified* market stratification (not subgroup fishing).
```
#### What This Part Should Cover
- A **risk-to-mitigation mapping** (each risk paired with the specific design or analysis lever), not a generic risk list.
- Recognizes that the mitigations are the **same levers chosen earlier** (unit choice, ITT, stratification), showing the design hangs together.
- Calls out **heterogeneity discipline** — pre-specified stratification, not post-hoc subgroup fishing.
### What a Strong Answer Covers
These dimensions span all parts and tie the case together:
- **Separates the three problems**: the product decision, the randomization design, and the statistical analysis — and doesn't conflate "guests like it" with "the marketplace is better off."
- Treats the **two-sided marketplace** as a first-class constraint throughout (guest wins are gated by host, safety, cost, and capacity guardrails).
- **Consistency across parts**: the randomization unit, estimand, metrics, power math, and risk mitigations all reference each other rather than contradicting.
- **Pre-registration discipline**: MDE, primary metric, and any subgroup analyses are fixed before looking at data.
### Follow-up Questions
- The treatment effect on the primary metric is positive and significant, but locker **over-capacity rate** breaches its guardrail in two high-volume markets. What do you recommend, and what's the next experiment?
- The A/A test shows a **2% sample-ratio mismatch** (e.g., 51/49 instead of 50/50) that is statistically significant. Walk through how you'd diagnose whether it's a logging artifact or a real assignment bug, and whether you'd trust the A/B result.
- How would you estimate the **treatment-on-treated (TOT)** effect for guests who actually used a locker, and what assumption does using assignment as an instrument require?
- Suppose the effect is strongly **heterogeneous** — large in dense urban markets, negative in rural ones. How does that change the launch decision and the rollout plan?
Quick Answer: This question evaluates a candidate's skills in experimental design and causal inference for marketplace features, including selecting the appropriate randomization unit, defining eligibility and treatment/control arms, and choosing primary, secondary, and guardrail metrics.