PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Analytics & Experimentation/Airbnb

Design an A/B test with causal inference

Last updated: Apr 21, 2026

Quick Overview

This question evaluates skills in experimental design, causal inference, and applied statistics — including estimand selection, sample-size calculation under clustering, integrity monitoring, handling noncompliance and contamination, sequential monitoring, and two-proportion inference — within the Analytics & Experimentation domain for a Data Scientist role. It is commonly asked because interviewers need to assess the ability to design and analyze robust A/B tests under real-world constraints; the prompt requires both conceptual understanding of causal assumptions and practical application of power calculations, diagnostics, and monitoring procedures.

  • hard
  • Airbnb
  • Analytics & Experimentation
  • Data Scientist

Design an A/B test with causal inference

Company: Airbnb

Role: Data Scientist

Category: Analytics & Experimentation

Difficulty: hard

Interview Round: Technical Screen

You own experimentation for an e-commerce checkout nudge. Design an A/B test randomized at the guest_id level and run for 28 days (2025-08-04 to 2025-08-31). Primary metric: completed order within 7 days of first exposure; guardrails: bounce rate and p95 page latency. Baseline 7-day per-guest conversion is 5%; minimum detectable relative lift is 8%; two-sided α=0.05; power=0.80. Average 1.6 sessions per guest with ICC=0.05. Constraints: repeat visitors across devices, 5% bot traffic, some cookie resets causing cross-arm contamination. Answer: 1) Define the estimand (ITT vs TOT) and justify the unit (guest vs session) and exposure definition with cross-device deduping and noncompliance. 2) Compute required per-arm sample size accounting for clustering (show the design effect and final n per arm). 3) Specify SRM and integrity checks (e.g., device/geo imbalance, traffic-source mix), how to detect, and how to remediate. 4) If randomization fails and you only have pre/post windows (pre: 2025-07-01–2025-07-31; post: 2025-09-01–2025-09-30), formulate a credible causal strategy (e.g., DiD with covariates/CUPED or PSM/IPW): state the identifying assumptions, write the ATE estimator, and describe how you’d test parallel trends and overlap. 5) Address interference/novelty and propose a sequential monitoring plan that controls Type I error (e.g., O’Brien–Fleming boundaries) and a plan for early stopping for harm. 6) Suppose the experiment ends with control conv=5.0% (n=120,000) and treatment conv=5.6% (n=120,000). Compute the lift, its standard error/95% CI (properly accounting for two-proportion comparison), and interpret both statistical and practical significance; would you ship under these constraints?

Quick Answer: This question evaluates skills in experimental design, causal inference, and applied statistics — including estimand selection, sample-size calculation under clustering, integrity monitoring, handling noncompliance and contamination, sequential monitoring, and two-proportion inference — within the Analytics & Experimentation domain for a Data Scientist role. It is commonly asked because interviewers need to assess the ability to design and analyze robust A/B tests under real-world constraints; the prompt requires both conceptual understanding of causal assumptions and practical application of power calculations, diagnostics, and monitoring procedures.

Related Interview Questions

  • Design and Analyze Airbnb Locker Experiment - Airbnb (medium)
  • Design a network-aware Wi‑Fi badge experiment - Airbnb (Medium)
  • Design robust primary and guardrail metrics - Airbnb (hard)
  • Analyze A/B test with rigorous diagnostics - Airbnb (hard)
  • Estimate impact of global launch without holdout - Airbnb (hard)
|Home/Analytics & Experimentation/Airbnb

Design an A/B test with causal inference

Airbnb logo
Airbnb
Oct 13, 2025, 9:49 PM
hardData ScientistTechnical ScreenAnalytics & Experimentation
55
0

A/B Test Design: Checkout Nudge (Guest-Level Randomization)

You own experimentation for an e-commerce checkout flow. You're launching a checkout nudge and need to design, run, and read out an A/B test — including what to do if randomization breaks down.

This is an end-to-end experimentation case: you'll define the estimand, size the test under realistic traffic conditions, defend its integrity, fall back to an observational causal design if randomization fails, monitor it without inflating error rates, and make a final ship/no-ship call from the readout numbers.

Setup

Timeline

  • Run window: 2025-08-04 to 2025-08-31 (28 days).
  • Maturation: because the primary metric needs a 7-day conversion window, analyze on a matured panel — either restrict to first exposures through 2025-08-24 (so every guest has a full 7-day lookback), or allow a 7-day measurement lag out to 2025-09-07 .

Experiment design

  • Randomization unit: guest_id , sticky across sessions for the duration of the test.
  • Primary metric: completed an order within 7 days of first exposure to the checkout nudge.
  • Guardrails: bounce rate; p95 page latency.

Statistical parameters

  • Baseline: 7-day per-guest conversion = 5% .
  • Minimum detectable effect (MDE): 8% relative lift over baseline.
  • Significance: two-sided α = 0.05 .
  • Power: 0.80 .
  • Clustering inputs: average 1.6 sessions per guest ; intra-class correlation (ICC) across sessions within a guest = 0.05 .

Traffic realities (constraints)

  • Repeat visitors across multiple devices .
  • ~ 5% bot traffic .
  • Some cookie resets , causing cross-arm contamination.

Clarifying Questions to Ask

Before designing, confirm scope with the interviewer. Strong candidates surface assumptions rather than guessing:

  • Allocation & ramp: Is this a clean 50/50 split, or do we ramp from a small treatment fraction first? Is there an existing holdback we must respect?
  • Eligibility & exposure: Does "exposure" mean assigned to the nudge, or rendered the nudge? Which surfaces/pages count as checkout-eligible, and are logged-out guests in scope?
  • Identity resolution: What deterministic keys do we have (logged-in user_id , hashed email/payment token) to link a guest across devices and cookie resets, and how reliable are they?
  • Decision criteria: What lift is worth shipping, and what guardrail movement is disqualifying (e.g., max tolerable bounce increase or p95 latency budget)?
  • Operational constraints: Are there concurrent experiments or campaigns that could change traffic mix or interact with this nudge? Is there seasonality in the run window?
  • Data trust: How are bots currently filtered, and at what stage (ingestion vs analysis)?

What a Strong Answer Covers

The interviewer is checking for these signals across the six parts (these are dimensions, not the answers):

  • Estimand discipline: picks a primary estimand and justifies it against the business decision; treats noncompliance/leakage explicitly rather than ignoring it.
  • Correct unit of analysis: reasons about why the randomization unit and the analysis unit must align, and what clustering does to standard errors.
  • Power arithmetic that accounts for reality: a defensible base sample size, an explicit design-effect adjustment, and buffers for contamination/bots — not a single textbook number.
  • Integrity-first instinct: checks the split before trusting any effect; distinguishes a randomization defect from a real treatment effect.
  • Credible causal fallback: names identifying assumptions, writes an estimator, and proposes falsification tests — not just "use DiD."
  • Inference hygiene under peeking: controls Type I error across interim looks and separates stop-for-benefit from stop-for-harm.
  • Numerate, decisive readout: computes effect, SE, CI, and a test statistic correctly, then ties the ship decision to both statistical and practical significance and the guardrails.

Tasks

  1. Estimand & unit. Define the estimand ( ITT vs TOT ), justify guest vs session as the analysis unit, and define exposure — including cross-device deduping and how you handle noncompliance.
  1. Sample size. Compute the required per-arm sample size accounting for clustering . Show the design effect and the final required n per arm .
  1. Integrity checks. Specify SRM and integrity checks (e.g., device/geo imbalance, traffic-source mix): how you'd detect them and how you'd remediate issues.
  1. Causal fallback. Suppose randomization fails and you only have pre/post windows (pre: 2025-07-01 to 2025-07-31; post: 2025-09-01 to 2025-09-30). Propose a credible causal strategy (e.g., DiD with covariates/ CUPED , or PSM/IPW ). State the identifying assumptions , write the ATE estimator , and describe tests for parallel trends and overlap .
  1. Interference, novelty & monitoring. Address interference and novelty effects , and propose a sequential monitoring plan that controls Type I error (e.g., O'Brien–Fleming boundaries), plus a plan for early stopping for harm .
  1. Readout & decision. Suppose the experiment ends with control conversion = 5.0% (n = 120,000) and treatment conversion = 5.6% (n = 120,000) . Compute the lift , its standard error and 95% CI for a two-proportion comparison; interpret statistical and practical significance ; and state whether you would ship .

Follow-up Questions

Be ready to go deeper after the main answer:

  • Heterogeneity: Treatment is positive overall but you suspect harm on low-end devices where the nudge adds latency. How would you detect a harmful subgroup without p-hacking across many slices?
  • Scale & traffic shortfall: If forecasted traffic only delivers ~60% of the required per-arm nnn in the 28-day window, what are your options, and how does each affect power, MDE, or run time?
  • Contamination worsens: If cookie resets push cross-arm contamination to 5%, what does that do to your observed ITT and your required sample size — and would you still trust the readout?
  • TOT divergence: If the nudge renders for only 70% of assigned-treatment guests, how do ITT and TOT diverge, and which one drives the ship decision versus the per-user efficacy story?

Constraints & Assumptions

Anchor your reasoning to these (don't invent additional numbers):

  • Equal allocation (50/50) unless you explicitly argue for a ramp; assignment is sticky per guest_id .
  • "First exposure" anchors the 7-day conversion window; only checkout-eligible guests enter the analysis population.
  • Bots (~5%) and cookie resets are known data-quality risks; state where you'd filter or buffer for them rather than assuming clean data.
  • For the causal fallback, assume a plausibly-comparable never-exposed control group exists (e.g., a retained holdback or unaffected surface/geo) and that the metric definition is stable across pre/post windows.
  • Use the readout numbers exactly as given ( pc=5.0%p_c=5.0\%pc​=5.0% , pt=5.6%p_t=5.6\%pt​=5.6% , n=120,000n=120{,}000n=120,000 per arm).
Loading comments...

Browse More Questions

More Analytics & Experimentation•More Airbnb•More Data Scientist•Airbnb Data Scientist•Airbnb Analytics & Experimentation•Data Scientist Analytics & Experimentation

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.