You own experimentation for an e-commerce checkout nudge. Design an A/B test randomized at the guest_id level and run for 28 days (2025-08-04 to 2025-08-31). Primary metric: completed order within 7 days of first exposure; guardrails: bounce rate and p95 page latency. Baseline 7-day per-guest conversion is 5%; minimum detectable relative lift is 8%; two-sided α=0.05; power=0.80. Average 1.6 sessions per guest with ICC=0.05. Constraints: repeat visitors across devices, 5% bot traffic, some cookie resets causing cross-arm contamination. Answer: 1) Define the estimand (ITT vs TOT) and justify the unit (guest vs session) and exposure definition with cross-device deduping and noncompliance. 2) Compute required per-arm sample size accounting for clustering (show the design effect and final n per arm). 3) Specify SRM and integrity checks (e.g., device/geo imbalance, traffic-source mix), how to detect, and how to remediate. 4) If randomization fails and you only have pre/post windows (pre: 2025-07-01–2025-07-31; post: 2025-09-01–2025-09-30), formulate a credible causal strategy (e.g., DiD with covariates/CUPED or PSM/IPW): state the identifying assumptions, write the ATE estimator, and describe how you’d test parallel trends and overlap. 5) Address interference/novelty and propose a sequential monitoring plan that controls Type I error (e.g., O’Brien–Fleming boundaries) and a plan for early stopping for harm. 6) Suppose the experiment ends with control conv=5.0% (n=120,000) and treatment conv=5.6% (n=120,000). Compute the lift, its standard error/95% CI (properly accounting for two-proportion comparison), and interpret both statistical and practical significance; would you ship under these constraints?

This question evaluates skills in experimental design, causal inference, and applied statistics — including estimand selection, sample-size calculation under clustering, integrity monitoring, handling noncompliance and contamination, sequential monitoring, and two-proportion inference — within the Analytics & Experimentation domain for a Data Scientist role. It is commonly asked because interviewers need to assess the ability to design and analyze robust A/B tests under real-world constraints; the prompt requires both conceptual understanding of causal assumptions and practical application of power calculations, diagnostics, and monitoring procedures.

How do I approach Analytics & Experimentation interview questions?

Analytics & Experimentation questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master analytics & experimentation interviews.

What difficulty level is this interview question?

This is a hard difficulty Analytics & Experimentation question, commonly asked during Technical Screen rounds at Airbnb.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Airbnb during technical interviews.

Design an A/B test with causal inference | Airbnb Interview Question

A/B Test Design: Checkout Nudge (Guest-Level Randomization)

Setup

Run dates: 2025-08-04 to 2025-08-31 (28 days). Analyze the primary metric on a matured panel (first exposures through 2025-08-24 to allow a 7-day lookback), or allow a 7-day measurement lag to 2025-09-07.
Randomization unit: guest_id (sticky across sessions during the test).
Primary metric: Completed order within 7 days of first exposure to the checkout nudge.
Guardrails: Bounce rate; p95 page latency.
Baseline 7-day conversion per guest: 5%.
Minimum detectable effect (MDE): 8% relative lift over baseline.
Test parameters: two-sided α = 0.05; power = 0.80.
Average sessions per guest: 1.6; intra-class correlation (ICC) across sessions within a guest: 0.05.
Constraints: repeat visitors across devices; ~5% bot traffic; some cookie resets causing cross-arm contamination.

Tasks

Define the estimand (ITT vs TOT), justify guest vs session as the analysis unit, and define exposure with cross-device deduping and noncompliance handling.
Compute required per-arm sample size accounting for clustering. Show the design effect and the final required n per arm.
Specify SRM and integrity checks (e.g., device/geo imbalance, traffic-source mix), how to detect them, and how to remediate issues.
If randomization fails and you only have pre/post windows (pre: 2025-07-01–2025-07-31; post: 2025-09-01–2025-09-30), propose a credible causal strategy (e.g., DiD with covariates/CUPED or PSM/IPW). State the identifying assumptions, write the ATE estimator, and describe tests for parallel trends and overlap.
Address interference/novelty and propose a sequential monitoring plan that controls Type I error (e.g., O’Brien–Fleming boundaries) and a plan for early stopping for harm.
Suppose the experiment ends with control conv = 5.0% (n = 120,000) and treatment conv = 5.6% (n = 120,000). Compute the lift, its standard error and 95% CI for a two-proportion comparison, interpret statistical and practical significance, and state whether you would ship.

A/B Test Design: Checkout Nudge (Guest-Level Randomization)

Setup

Run dates: 2025-08-04 to 2025-08-31 (28 days). Analyze the primary metric on a matured panel (first exposures through 2025-08-24 to allow a 7-day lookback), or allow a 7-day measurement lag to 2025-09-07.
Randomization unit: guest_id (sticky across sessions during the test).
Primary metric: Completed order within 7 days of first exposure to the checkout nudge.
Guardrails: Bounce rate; p95 page latency.
Baseline 7-day conversion per guest: 5%.
Minimum detectable effect (MDE): 8% relative lift over baseline.
Test parameters: two-sided α = 0.05; power = 0.80.
Average sessions per guest: 1.6; intra-class correlation (ICC) across sessions within a guest: 0.05.
Constraints: repeat visitors across devices; ~5% bot traffic; some cookie resets causing cross-arm contamination.

Tasks

Define the estimand (ITT vs TOT), justify guest vs session as the analysis unit, and define exposure with cross-device deduping and noncompliance handling.
Compute required per-arm sample size accounting for clustering. Show the design effect and the final required n per arm.
Specify SRM and integrity checks (e.g., device/geo imbalance, traffic-source mix), how to detect them, and how to remediate issues.
If randomization fails and you only have pre/post windows (pre: 2025-07-01–2025-07-31; post: 2025-09-01–2025-09-30), propose a credible causal strategy (e.g., DiD with covariates/CUPED or PSM/IPW). State the identifying assumptions, write the ATE estimator, and describe tests for parallel trends and overlap.
Address interference/novelty and propose a sequential monitoring plan that controls Type I error (e.g., O’Brien–Fleming boundaries) and a plan for early stopping for harm.
Suppose the experiment ends with control conv = 5.0% (n = 120,000) and treatment conv = 5.6% (n = 120,000). Compute the lift, its standard error and 95% CI for a two-proportion comparison, interpret statistical and practical significance, and state whether you would ship.

Design an A/B test with causal inference

Quick Overview

A/B Test Design: Checkout Nudge (Guest-Level Randomization)

Setup

Tasks

Solution

Comments (0)

Design an A/B test with causal inference

Quick Overview

A/B Test Design: Checkout Nudge (Guest-Level Randomization)

Setup

Tasks

Solution

Comments (0)