PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Analytics & Experimentation/Airbnb

Design and Analyze Airbnb Locker Experiment

Last updated: Jun 21, 2026

Quick Overview

This question evaluates a candidate's skills in experimental design and causal inference for marketplace features, including selecting the appropriate randomization unit, defining eligibility and treatment/control arms, and choosing primary, secondary, and guardrail metrics.

  • medium
  • Airbnb
  • Analytics & Experimentation
  • Data Scientist

Design and Analyze Airbnb Locker Experiment

Company: Airbnb

Role: Data Scientist

Category: Analytics & Experimentation

Difficulty: medium

Interview Round: Technical Screen

Airbnb is considering launching a **luggage locker feature** that lets guests store their bags before their scheduled check-in time, so they don't have to wait until the room is ready. You are the data scientist asked to design — and then analyze — an experiment that decides whether Airbnb should launch this feature. The interview has two halves: a **case-study design** discussion (Parts 1–5, 7) and a **hands-on analysis** of a provided experiment dataset (Part 6). ### Constraints & Assumptions - Airbnb is a **two-sided marketplace**: an intervention can affect guests, hosts, and the platform's revenue simultaneously, and these interests can conflict. - Lockers are (at least partly) **shared physical infrastructure** with finite capacity in a given location. - Outcomes span the full booking lifecycle: assignment may happen at booking time, but key outcomes (post-stay rating, support contacts, damage claims) only mature **after checkout**. - Assume standard significance and power targets unless you argue otherwise: $\alpha = 0.05$ (two-sided) and $80\%$ power. - You have access to historical baselines (e.g., support-contact rate, booking conversion) to power the test. ### Clarifying Questions to Ask - Is the locker offered **before booking** (a search/booking feature) or **after booking** (a check-in convenience)? This determines the primary metric and randomization point. - Are lockers **on-listing** (host-operated, e.g., a smart lock box) or **shared off-site** infrastructure with pooled capacity? This determines whether SUTVA can hold at the reservation level. - What is the intended **launch mechanism** — a UX toggle, a host opt-in, or a market-wide rollout? The experiment should mirror how it will actually ship. - Is the feature **free or priced**? Pricing changes both the metric set and the revenue tradeoff. - What baseline rates do we have for the candidate primary metric, and what effect size would be **worth launching**? - Are there **safety, liability, or insurance** constraints (lost/damaged luggage) that gate launch regardless of the metric outcome? ### Part 1 — Randomization unit Should the experiment be randomized on the **guest** side, **reservation** side, **host/listing** side, or **market** side? Justify your choice, and explain how the answer depends on what exactly the treatment changes. ```hint What breaks independence? The right unit is the smallest one for which the Stable Unit Treatment Value Assumption (SUTVA) still holds — i.e., one unit's treatment doesn't affect another unit's outcome. Ask: can a treated unit *contaminate* a control unit? Two channels do that here — shared **physical capacity** and the **host changing behavior** for everyone at a listing. ``` ```hint Tie the unit to the mechanism Map each candidate unit to the intervention it cleanly isolates: guest/reservation isolates *demand and UX* (does showing the option change behavior?); listing/host isolates a *host-operated rollout*; market/building isolates *shared-capacity* effects. If you cluster, you'll later pay for it in sample size and inference. ``` #### What This Part Should Cover - Reasons from **SUTVA**, not from a default ("just randomize users"). - Identifies both contamination channels — **shared locker capacity** and **host-level behavior change** — and which unit internalizes each. - Connects the unit choice to the **launch mechanism** and acknowledges the **power vs. validity** tradeoff (finer = more powerful but riskier). ### Part 2 — Eligibility, treatment, control Define the **eligible population**, the **treatment** experience, and the **control** experience precisely enough that an engineer could implement the assignment logic. ```hint Eligibility should be defined *before* and *independently* of treatment (never on a post-assignment behavior like "guests who clicked the locker"). Think about who can even be offered a locker (geography, listing type, arrival-before-check-in likelihood) and whether control is "nothing" or "today's host-managed luggage workaround." ``` #### What This Part Should Cover - Eligibility defined **pre-assignment** and independent of any post-treatment behavior. - Treatment defined as the **offer of locker access**, not actual usage. - Control defined as the **realistic status quo** (current host-managed workaround), not a strawman. ### Part 3 — Metrics Choose a **primary** metric, a set of **secondary** metrics, and **guardrail** metrics. Discuss the tradeoffs among guest experience, host experience, operational cost, safety, and marketplace revenue. ```hint Pick primary by where the feature lives in the funnel A pre-booking feature argues for a conversion/GBV primary; a post-booking convenience feature argues for a satisfaction or support-load primary. One sentence to anchor the choice: the primary should be sensitive to the feature, move-able in a realistic test window, and aligned with the launch goal. ``` ```hint Guardrails protect the side you're *not* optimizing Because this is a marketplace, list guardrails on the host/safety/cost side (cancellations, damage/lost-item claims, support cost per reservation, locker over-capacity) so a guest win that quietly burns hosts or trust-and-safety gets caught. ``` #### What This Part Should Cover - A **single, well-justified primary** tied to the feature's funnel position. - A **coherent metric tree** (primary / secondary / guardrail), not a flat list of everything measurable. - **Explicit marketplace tradeoffs** — names the guardrails that protect hosts, safety, and cost, and states a launch decision rule. ### Part 4 — Sample size and duration Explain how you would compute the **required sample size** and decide the **experiment duration**. ```hint Sample size For a binary primary (e.g., support-contact rate), use a two-proportion power calculation driven by a *pre-registered* minimum detectable effect (MDE). If you randomize at a cluster (listing/market) rather than the reservation, inflate $n$ by the design effect $\text{DEFF} = 1 + (\bar{m}-1)\rho$, where $\bar{m}$ is average cluster size and $\rho$ is the intraclass correlation. ``` ```hint Duration is not just n / traffic Duration must also cover weekly seasonality (weekday/weekend travel) **and** the booking-to-checkout lag, since post-stay outcomes only mature after guests leave. Low-traffic markets may gate the timeline more than the math does. ``` #### What This Part Should Cover - Sample size driven by a **pre-registered MDE** and the correct **two-proportion power formula** (for a binary primary). - **Design-effect inflation** when randomizing at a cluster, with the right intuition for $\rho$ and cluster size. - Duration reasoning that goes **beyond $n/\text{traffic}$**: weekly seasonality, the booking-to-checkout maturation lag, and thin-market constraints. ### Part 5 — Analysis plan Describe how you would **estimate the treatment effect** and **compute a p-value** for the primary metric, including how you'd reduce variance and handle clustered randomization. ```hint Estimand first, test second Default to **intent-to-treat** (compare by *assignment*, not by who actually used the locker) so self-selection doesn't bias you. Then pick a test matched to the metric type (two-proportion z / logistic for binary; t-test / regression / bootstrap for continuous), and report the **CI and effect size**, not just $p$. Variance reduction: covariate adjustment / CUPED on pre-period behavior; if clustered, cluster-robust SEs at the randomization level. ``` #### What This Part Should Cover - States the **ITT estimand** explicitly and never conditions on post-treatment usage. - Matches the **test to the metric type** and reports a **CI / effect size**, not just a p-value. - **Variance reduction** (covariate adjustment / CUPED) and **cluster-robust inference** matched to the actual randomization level. ### Part 6 — Hands-on A/B and A/A test You are given a **reservation-level experiment table** with these columns: | column | type | meaning | |---|---|---| | `unit_id` | string | randomization unit id | | `guest_id`, `host_id`, `listing_id` | string | identity / clustering keys | | `market` | string | market / city | | `assignment_variant` | string | `treatment` or `control` | | `assigned_at` | timestamp (UTC) | when the unit was assigned | | `checkin_at`, `checkout_at` | timestamp (UTC) | stay window | | `booked` | int | 1 if a booking was made | | `used_locker` | int | 1 if the guest actually used a locker (post-treatment) | | `support_contacted` | int | 1 if guest contacted support | | `post_stay_rating` | float | rating after stay | | `gross_booking_value` | float | GBV in currency units | | `pre_assignment_trips` | int | prior trips (pre-period covariate) | | `pre_assignment_support_rate` | float | prior support rate (pre-period covariate) | | `is_aa_test` | bool | TRUE if the row belongs to an A/A holdout | Explain, concretely, how you would (a) run the **A/B test** to estimate the effect and p-value on this table, and (b) run an **A/A test** to detect bias, sample-ratio mismatch, or instrumentation issues. Reference the specific columns you'd use. ```hint A/B Filter to `is_aa_test == False`, validate one row per `unit_id` and that `assigned_at` precedes outcomes, then run the test on `support_contacted` (or your chosen primary). `used_locker` is **post-treatment** — never condition the main estimate on it. ``` ```hint A/A The A/A rows (`is_aa_test == True`) should show *no* real effect. Two checks: a **sample-ratio-mismatch** test (chi-square of variant counts vs. expected split) and **covariate balance** on `pre_assignment_trips` / `pre_assignment_support_rate` / `market`. In large samples, prefer **standardized mean differences** over raw p-values for balance — tiny imbalances become "significant." ``` #### What This Part Should Cover - A **column-specific, runnable** A/B procedure: filter on `is_aa_test`, validate `unit_id` uniqueness and that `assigned_at` precedes outcomes, estimate ITT on the chosen primary. - Treats `used_locker` strictly as **post-treatment** (a diagnostic, never a filter or conditioning variable for the headline). - An A/A that runs the **SRM (chi-square)** check and **covariate balance** via standardized mean differences, and explains what a failure implies for trusting the live A/B. ### Part 7 — Risks Discuss the risks that threaten this specific experiment: **selection bias**, **interference** between treatment and control, **locker capacity constraints**, **host-side spillovers**, and **heterogeneous effects** across markets. For each, name how your design or analysis mitigates it. ```hint Each risk maps to a design lever you already chose: selection bias → ITT; interference/capacity → choice of randomization unit (cluster) + capacity-aware analysis; host spillovers → listing/host-level unit; heterogeneity → *pre-specified* market stratification (not subgroup fishing). ``` #### What This Part Should Cover - A **risk-to-mitigation mapping** (each risk paired with the specific design or analysis lever), not a generic risk list. - Recognizes that the mitigations are the **same levers chosen earlier** (unit choice, ITT, stratification), showing the design hangs together. - Calls out **heterogeneity discipline** — pre-specified stratification, not post-hoc subgroup fishing. ### What a Strong Answer Covers These dimensions span all parts and tie the case together: - **Separates the three problems**: the product decision, the randomization design, and the statistical analysis — and doesn't conflate "guests like it" with "the marketplace is better off." - Treats the **two-sided marketplace** as a first-class constraint throughout (guest wins are gated by host, safety, cost, and capacity guardrails). - **Consistency across parts**: the randomization unit, estimand, metrics, power math, and risk mitigations all reference each other rather than contradicting. - **Pre-registration discipline**: MDE, primary metric, and any subgroup analyses are fixed before looking at data. ### Follow-up Questions - The treatment effect on the primary metric is positive and significant, but locker **over-capacity rate** breaches its guardrail in two high-volume markets. What do you recommend, and what's the next experiment? - The A/A test shows a **2% sample-ratio mismatch** (e.g., 51/49 instead of 50/50) that is statistically significant. Walk through how you'd diagnose whether it's a logging artifact or a real assignment bug, and whether you'd trust the A/B result. - How would you estimate the **treatment-on-treated (TOT)** effect for guests who actually used a locker, and what assumption does using assignment as an instrument require? - Suppose the effect is strongly **heterogeneous** — large in dense urban markets, negative in rural ones. How does that change the launch decision and the rollout plan?

Quick Answer: This question evaluates a candidate's skills in experimental design and causal inference for marketplace features, including selecting the appropriate randomization unit, defining eligibility and treatment/control arms, and choosing primary, secondary, and guardrail metrics.

Related Interview Questions

  • Design a network-aware Wi‑Fi badge experiment - Airbnb (Medium)
  • Design an A/B test with causal inference - Airbnb (hard)
  • Design robust primary and guardrail metrics - Airbnb (hard)
  • Analyze A/B test with rigorous diagnostics - Airbnb (hard)
  • Estimate impact of global launch without holdout - Airbnb (hard)
Airbnb logo
Airbnb
Feb 21, 2026, 12:00 AM
Data Scientist
Technical Screen
Analytics & Experimentation
28
0

Airbnb is considering launching a luggage locker feature that lets guests store their bags before their scheduled check-in time, so they don't have to wait until the room is ready. You are the data scientist asked to design — and then analyze — an experiment that decides whether Airbnb should launch this feature.

The interview has two halves: a case-study design discussion (Parts 1–5, 7) and a hands-on analysis of a provided experiment dataset (Part 6).

Constraints & Assumptions

  • Airbnb is a two-sided marketplace : an intervention can affect guests, hosts, and the platform's revenue simultaneously, and these interests can conflict.
  • Lockers are (at least partly) shared physical infrastructure with finite capacity in a given location.
  • Outcomes span the full booking lifecycle: assignment may happen at booking time, but key outcomes (post-stay rating, support contacts, damage claims) only mature after checkout .
  • Assume standard significance and power targets unless you argue otherwise: α=0.05\alpha = 0.05α=0.05 (two-sided) and 80%80\%80% power.
  • You have access to historical baselines (e.g., support-contact rate, booking conversion) to power the test.

Clarifying Questions to Ask

  • Is the locker offered before booking (a search/booking feature) or after booking (a check-in convenience)? This determines the primary metric and randomization point.
  • Are lockers on-listing (host-operated, e.g., a smart lock box) or shared off-site infrastructure with pooled capacity? This determines whether SUTVA can hold at the reservation level.
  • What is the intended launch mechanism — a UX toggle, a host opt-in, or a market-wide rollout? The experiment should mirror how it will actually ship.
  • Is the feature free or priced ? Pricing changes both the metric set and the revenue tradeoff.
  • What baseline rates do we have for the candidate primary metric, and what effect size would be worth launching ?
  • Are there safety, liability, or insurance constraints (lost/damaged luggage) that gate launch regardless of the metric outcome?

Part 1 — Randomization unit

Should the experiment be randomized on the guest side, reservation side, host/listing side, or market side? Justify your choice, and explain how the answer depends on what exactly the treatment changes.

What This Part Should Cover

  • Reasons from SUTVA , not from a default ("just randomize users").
  • Identifies both contamination channels — shared locker capacity and host-level behavior change — and which unit internalizes each.
  • Connects the unit choice to the launch mechanism and acknowledges the power vs. validity tradeoff (finer = more powerful but riskier).

Part 2 — Eligibility, treatment, control

Define the eligible population, the treatment experience, and the control experience precisely enough that an engineer could implement the assignment logic.

What This Part Should Cover

  • Eligibility defined pre-assignment and independent of any post-treatment behavior.
  • Treatment defined as the offer of locker access , not actual usage.
  • Control defined as the realistic status quo (current host-managed workaround), not a strawman.

Part 3 — Metrics

Choose a primary metric, a set of secondary metrics, and guardrail metrics. Discuss the tradeoffs among guest experience, host experience, operational cost, safety, and marketplace revenue.

What This Part Should Cover

  • A single, well-justified primary tied to the feature's funnel position.
  • A coherent metric tree (primary / secondary / guardrail), not a flat list of everything measurable.
  • Explicit marketplace tradeoffs — names the guardrails that protect hosts, safety, and cost, and states a launch decision rule.

Part 4 — Sample size and duration

Explain how you would compute the required sample size and decide the experiment duration.

What This Part Should Cover

  • Sample size driven by a pre-registered MDE and the correct two-proportion power formula (for a binary primary).
  • Design-effect inflation when randomizing at a cluster, with the right intuition for ρ\rhoρ and cluster size.
  • Duration reasoning that goes beyond n/trafficn/\text{traffic}n/traffic : weekly seasonality, the booking-to-checkout maturation lag, and thin-market constraints.

Part 5 — Analysis plan

Describe how you would estimate the treatment effect and compute a p-value for the primary metric, including how you'd reduce variance and handle clustered randomization.

What This Part Should Cover

  • States the ITT estimand explicitly and never conditions on post-treatment usage.
  • Matches the test to the metric type and reports a CI / effect size , not just a p-value.
  • Variance reduction (covariate adjustment / CUPED) and cluster-robust inference matched to the actual randomization level.

Part 6 — Hands-on A/B and A/A test

You are given a reservation-level experiment table with these columns:

columntypemeaning
unit_idstringrandomization unit id
guest_id, host_id, listing_idstringidentity / clustering keys
marketstringmarket / city
assignment_variantstringtreatment or control
assigned_attimestamp (UTC)when the unit was assigned
checkin_at, checkout_attimestamp (UTC)stay window
bookedint1 if a booking was made
used_lockerint1 if the guest actually used a locker (post-treatment)
support_contactedint1 if guest contacted support
post_stay_ratingfloatrating after stay
gross_booking_valuefloatGBV in currency units
pre_assignment_tripsintprior trips (pre-period covariate)
pre_assignment_support_ratefloatprior support rate (pre-period covariate)
is_aa_testboolTRUE if the row belongs to an A/A holdout

Explain, concretely, how you would (a) run the A/B test to estimate the effect and p-value on this table, and (b) run an A/A test to detect bias, sample-ratio mismatch, or instrumentation issues. Reference the specific columns you'd use.

What This Part Should Cover

  • A column-specific, runnable A/B procedure: filter on is_aa_test , validate unit_id uniqueness and that assigned_at precedes outcomes, estimate ITT on the chosen primary.
  • Treats used_locker strictly as post-treatment (a diagnostic, never a filter or conditioning variable for the headline).
  • An A/A that runs the SRM (chi-square) check and covariate balance via standardized mean differences, and explains what a failure implies for trusting the live A/B.

Part 7 — Risks

Discuss the risks that threaten this specific experiment: selection bias, interference between treatment and control, locker capacity constraints, host-side spillovers, and heterogeneous effects across markets. For each, name how your design or analysis mitigates it.

What This Part Should Cover

  • A risk-to-mitigation mapping (each risk paired with the specific design or analysis lever), not a generic risk list.
  • Recognizes that the mitigations are the same levers chosen earlier (unit choice, ITT, stratification), showing the design hangs together.
  • Calls out heterogeneity discipline — pre-specified stratification, not post-hoc subgroup fishing.

What a Strong Answer Covers

These dimensions span all parts and tie the case together:

  • Separates the three problems : the product decision, the randomization design, and the statistical analysis — and doesn't conflate "guests like it" with "the marketplace is better off."
  • Treats the two-sided marketplace as a first-class constraint throughout (guest wins are gated by host, safety, cost, and capacity guardrails).
  • Consistency across parts : the randomization unit, estimand, metrics, power math, and risk mitigations all reference each other rather than contradicting.
  • Pre-registration discipline : MDE, primary metric, and any subgroup analyses are fixed before looking at data.

Follow-up Questions

  • The treatment effect on the primary metric is positive and significant, but locker over-capacity rate breaches its guardrail in two high-volume markets. What do you recommend, and what's the next experiment?
  • The A/A test shows a 2% sample-ratio mismatch (e.g., 51/49 instead of 50/50) that is statistically significant. Walk through how you'd diagnose whether it's a logging artifact or a real assignment bug, and whether you'd trust the A/B result.
  • How would you estimate the treatment-on-treated (TOT) effect for guests who actually used a locker, and what assumption does using assignment as an instrument require?
  • Suppose the effect is strongly heterogeneous — large in dense urban markets, negative in rural ones. How does that change the launch decision and the rollout plan?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Analytics & Experimentation•More Airbnb•More Data Scientist•Airbnb Data Scientist•Airbnb Analytics & Experimentation•Data Scientist Analytics & Experimentation
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.