How do I approach Analytics & Experimentation interview questions?

Analytics & Experimentation questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master analytics & experimentation interviews.

What difficulty level is this interview question?

This is a medium difficulty Analytics & Experimentation question, commonly asked during Technical Screen rounds at Airbnb.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Airbnb during technical interviews.

Design and Analyze Airbnb Locker Experiment

Q: Design and Analyze Airbnb Locker Experiment

This question evaluates a candidate's skills in experimental design and causal inference for marketplace features, including selecting the appropriate randomization unit, defining eligibility and treatment/control arms, and choosing primary, secondary, and guardrail metrics.

Airbnb is considering launching a luggage locker feature that lets guests store their bags before their scheduled check-in time, so they don't have to wait until the room is ready. You are the data scientist asked to design — and then analyze — an experiment that decides whether Airbnb should launch this feature.

The interview has two halves: a case-study design discussion (Parts 1–5, 7) and a hands-on analysis of a provided experiment dataset (Part 6).

Constraints & Assumptions

Airbnb is a two-sided marketplace : an intervention can affect guests, hosts, and the platform's revenue simultaneously, and these interests can conflict.
Lockers are (at least partly) shared physical infrastructure with finite capacity in a given location.
Outcomes span the full booking lifecycle: assignment may happen at booking time, but key outcomes (post-stay rating, support contacts, damage claims) only mature after checkout .
Assume standard significance and power targets unless you argue otherwise: $\alpha = 0.05$ (two-sided) and $80\%$ power.
You have access to historical baselines (e.g., support-contact rate, booking conversion) to power the test.

Clarifying Questions to Ask

Is the locker offered before booking (a search/booking feature) or after booking (a check-in convenience)? This determines the primary metric and randomization point.
Are lockers on-listing (host-operated, e.g., a smart lock box) or shared off-site infrastructure with pooled capacity? This determines whether SUTVA can hold at the reservation level.
What is the intended launch mechanism — a UX toggle, a host opt-in, or a market-wide rollout? The experiment should mirror how it will actually ship.
Is the feature free or priced ? Pricing changes both the metric set and the revenue tradeoff.
What baseline rates do we have for the candidate primary metric, and what effect size would be worth launching ?
Are there safety, liability, or insurance constraints (lost/damaged luggage) that gate launch regardless of the metric outcome?

Part 1 — Randomization unit

Should the experiment be randomized on the guest side, reservation side, host/listing side, or market side? Justify your choice, and explain how the answer depends on what exactly the treatment changes.

What This Part Should Cover

Reasons from SUTVA , not from a default ("just randomize users").
Identifies both contamination channels — shared locker capacity and host-level behavior change — and which unit internalizes each.
Connects the unit choice to the launch mechanism and acknowledges the power vs. validity tradeoff (finer = more powerful but riskier).

Part 2 — Eligibility, treatment, control

Define the eligible population, the treatment experience, and the control experience precisely enough that an engineer could implement the assignment logic.

What This Part Should Cover

Eligibility defined pre-assignment and independent of any post-treatment behavior.
Treatment defined as the offer of locker access , not actual usage.
Control defined as the realistic status quo (current host-managed workaround), not a strawman.

Part 3 — Metrics

Choose a primary metric, a set of secondary metrics, and guardrail metrics. Discuss the tradeoffs among guest experience, host experience, operational cost, safety, and marketplace revenue.

What This Part Should Cover

A single, well-justified primary tied to the feature's funnel position.
A coherent metric tree (primary / secondary / guardrail), not a flat list of everything measurable.
Explicit marketplace tradeoffs — names the guardrails that protect hosts, safety, and cost, and states a launch decision rule.

Part 4 — Sample size and duration

Explain how you would compute the required sample size and decide the experiment duration.

What This Part Should Cover

Sample size driven by a pre-registered MDE and the correct two-proportion power formula (for a binary primary).
Design-effect inflation when randomizing at a cluster, with the right intuition for $\rho$ and cluster size.
Duration reasoning that goes beyond $n/\text{traffic}$ : weekly seasonality, the booking-to-checkout maturation lag, and thin-market constraints.

Part 5 — Analysis plan

Describe how you would estimate the treatment effect and compute a p-value for the primary metric, including how you'd reduce variance and handle clustered randomization.

What This Part Should Cover

States the ITT estimand explicitly and never conditions on post-treatment usage.
Matches the test to the metric type and reports a CI / effect size , not just a p-value.
Variance reduction (covariate adjustment / CUPED) and cluster-robust inference matched to the actual randomization level.

Part 6 — Hands-on A/B and A/A test

You are given a reservation-level experiment table with these columns:

column	type	meaning
`unit_id`	string	randomization unit id
`guest_id`, `host_id`, `listing_id`	string	identity / clustering keys
`market`	string	market / city
`assignment_variant`	string	`treatment` or `control`
`assigned_at`	timestamp (UTC)	when the unit was assigned
`checkin_at`, `checkout_at`	timestamp (UTC)	stay window
`booked`	int	1 if a booking was made
`used_locker`	int	1 if the guest actually used a locker (post-treatment)
`support_contacted`	int	1 if guest contacted support
`post_stay_rating`	float	rating after stay
`gross_booking_value`	float	GBV in currency units
`pre_assignment_trips`	int	prior trips (pre-period covariate)
`pre_assignment_support_rate`	float	prior support rate (pre-period covariate)
`is_aa_test`	bool	TRUE if the row belongs to an A/A holdout

Explain, concretely, how you would (a) run the A/B test to estimate the effect and p-value on this table, and (b) run an A/A test to detect bias, sample-ratio mismatch, or instrumentation issues. Reference the specific columns you'd use.

What This Part Should Cover

A column-specific, runnable A/B procedure: filter on is_aa_test , validate unit_id uniqueness and that assigned_at precedes outcomes, estimate ITT on the chosen primary.
Treats used_locker strictly as post-treatment (a diagnostic, never a filter or conditioning variable for the headline).
An A/A that runs the SRM (chi-square) check and covariate balance via standardized mean differences, and explains what a failure implies for trusting the live A/B.

Part 7 — Risks

Discuss the risks that threaten this specific experiment: selection bias, interference between treatment and control, locker capacity constraints, host-side spillovers, and heterogeneous effects across markets. For each, name how your design or analysis mitigates it.

What This Part Should Cover

A risk-to-mitigation mapping (each risk paired with the specific design or analysis lever), not a generic risk list.
Recognizes that the mitigations are the same levers chosen earlier (unit choice, ITT, stratification), showing the design hangs together.
Calls out heterogeneity discipline — pre-specified stratification, not post-hoc subgroup fishing.

What a Strong Answer Covers

These dimensions span all parts and tie the case together:

Separates the three problems : the product decision, the randomization design, and the statistical analysis — and doesn't conflate "guests like it" with "the marketplace is better off."
Treats the two-sided marketplace as a first-class constraint throughout (guest wins are gated by host, safety, cost, and capacity guardrails).
Consistency across parts : the randomization unit, estimand, metrics, power math, and risk mitigations all reference each other rather than contradicting.
Pre-registration discipline : MDE, primary metric, and any subgroup analyses are fixed before looking at data.

Follow-up Questions

The treatment effect on the primary metric is positive and significant, but locker over-capacity rate breaches its guardrail in two high-volume markets. What do you recommend, and what's the next experiment?
The A/A test shows a 2% sample-ratio mismatch (e.g., 51/49 instead of 50/50) that is statistically significant. Walk through how you'd diagnose whether it's a logging artifact or a real assignment bug, and whether you'd trust the A/B result.
How would you estimate the treatment-on-treated (TOT) effect for guests who actually used a locker, and what assumption does using assignment as an instrument require?
Suppose the effect is strongly heterogeneous — large in dense urban markets, negative in rural ones. How does that change the launch decision and the rollout plan?

The interview has two halves: a case-study design discussion (Parts 1–5, 7) and a hands-on analysis of a provided experiment dataset (Part 6).

Constraints & Assumptions

Airbnb is a two-sided marketplace : an intervention can affect guests, hosts, and the platform's revenue simultaneously, and these interests can conflict.
Lockers are (at least partly) shared physical infrastructure with finite capacity in a given location.
Outcomes span the full booking lifecycle: assignment may happen at booking time, but key outcomes (post-stay rating, support contacts, damage claims) only mature after checkout .
Assume standard significance and power targets unless you argue otherwise: $\alpha = 0.05$ (two-sided) and $80\%$ power.
You have access to historical baselines (e.g., support-contact rate, booking conversion) to power the test.

Clarifying Questions to Ask

Is the locker offered before booking (a search/booking feature) or after booking (a check-in convenience)? This determines the primary metric and randomization point.
Are lockers on-listing (host-operated, e.g., a smart lock box) or shared off-site infrastructure with pooled capacity? This determines whether SUTVA can hold at the reservation level.
What is the intended launch mechanism — a UX toggle, a host opt-in, or a market-wide rollout? The experiment should mirror how it will actually ship.
Is the feature free or priced ? Pricing changes both the metric set and the revenue tradeoff.
What baseline rates do we have for the candidate primary metric, and what effect size would be worth launching ?
Are there safety, liability, or insurance constraints (lost/damaged luggage) that gate launch regardless of the metric outcome?

Part 1 — Randomization unit

What This Part Should Cover

Reasons from SUTVA , not from a default ("just randomize users").
Identifies both contamination channels — shared locker capacity and host-level behavior change — and which unit internalizes each.
Connects the unit choice to the launch mechanism and acknowledges the power vs. validity tradeoff (finer = more powerful but riskier).

Part 2 — Eligibility, treatment, control

Define the eligible population, the treatment experience, and the control experience precisely enough that an engineer could implement the assignment logic.

What This Part Should Cover

Eligibility defined pre-assignment and independent of any post-treatment behavior.
Treatment defined as the offer of locker access , not actual usage.
Control defined as the realistic status quo (current host-managed workaround), not a strawman.

Part 3 — Metrics

What This Part Should Cover

A single, well-justified primary tied to the feature's funnel position.
A coherent metric tree (primary / secondary / guardrail), not a flat list of everything measurable.
Explicit marketplace tradeoffs — names the guardrails that protect hosts, safety, and cost, and states a launch decision rule.

Part 4 — Sample size and duration

Explain how you would compute the required sample size and decide the experiment duration.

What This Part Should Cover

Sample size driven by a pre-registered MDE and the correct two-proportion power formula (for a binary primary).
Design-effect inflation when randomizing at a cluster, with the right intuition for $\rho$ and cluster size.
Duration reasoning that goes beyond $n/\text{traffic}$ : weekly seasonality, the booking-to-checkout maturation lag, and thin-market constraints.

Part 5 — Analysis plan

Describe how you would estimate the treatment effect and compute a p-value for the primary metric, including how you'd reduce variance and handle clustered randomization.

What This Part Should Cover

States the ITT estimand explicitly and never conditions on post-treatment usage.
Matches the test to the metric type and reports a CI / effect size , not just a p-value.
Variance reduction (covariate adjustment / CUPED) and cluster-robust inference matched to the actual randomization level.

Part 6 — Hands-on A/B and A/A test

You are given a reservation-level experiment table with these columns:

column	type	meaning
`unit_id`	string	randomization unit id
`guest_id`, `host_id`, `listing_id`	string	identity / clustering keys
`market`	string	market / city
`assignment_variant`	string	`treatment` or `control`
`assigned_at`	timestamp (UTC)	when the unit was assigned
`checkin_at`, `checkout_at`	timestamp (UTC)	stay window
`booked`	int	1 if a booking was made
`used_locker`	int	1 if the guest actually used a locker (post-treatment)
`support_contacted`	int	1 if guest contacted support
`post_stay_rating`	float	rating after stay
`gross_booking_value`	float	GBV in currency units
`pre_assignment_trips`	int	prior trips (pre-period covariate)
`pre_assignment_support_rate`	float	prior support rate (pre-period covariate)
`is_aa_test`	bool	TRUE if the row belongs to an A/A holdout

What This Part Should Cover

A column-specific, runnable A/B procedure: filter on is_aa_test , validate unit_id uniqueness and that assigned_at precedes outcomes, estimate ITT on the chosen primary.
Treats used_locker strictly as post-treatment (a diagnostic, never a filter or conditioning variable for the headline).
An A/A that runs the SRM (chi-square) check and covariate balance via standardized mean differences, and explains what a failure implies for trusting the live A/B.

Part 7 — Risks

What This Part Should Cover

A risk-to-mitigation mapping (each risk paired with the specific design or analysis lever), not a generic risk list.
Recognizes that the mitigations are the same levers chosen earlier (unit choice, ITT, stratification), showing the design hangs together.
Calls out heterogeneity discipline — pre-specified stratification, not post-hoc subgroup fishing.

What a Strong Answer Covers

These dimensions span all parts and tie the case together:

Separates the three problems : the product decision, the randomization design, and the statistical analysis — and doesn't conflate "guests like it" with "the marketplace is better off."
Treats the two-sided marketplace as a first-class constraint throughout (guest wins are gated by host, safety, cost, and capacity guardrails).
Consistency across parts : the randomization unit, estimand, metrics, power math, and risk mitigations all reference each other rather than contradicting.
Pre-registration discipline : MDE, primary metric, and any subgroup analyses are fixed before looking at data.

Follow-up Questions

The treatment effect on the primary metric is positive and significant, but locker over-capacity rate breaches its guardrail in two high-volume markets. What do you recommend, and what's the next experiment?
The A/A test shows a 2% sample-ratio mismatch (e.g., 51/49 instead of 50/50) that is statistically significant. Walk through how you'd diagnose whether it's a logging artifact or a real assignment bug, and whether you'd trust the A/B result.
How would you estimate the treatment-on-treated (TOT) effect for guests who actually used a locker, and what assumption does using assignment as an instrument require?
Suppose the effect is strongly heterogeneous — large in dense urban markets, negative in rural ones. How does that change the launch decision and the rollout plan?

Design and Analyze Airbnb Locker Experiment

Quick Overview