This question evaluates skills in experimental design, causal inference, and applied statistics — including estimand selection, sample-size calculation under clustering, integrity monitoring, handling noncompliance and contamination, sequential monitoring, and two-proportion inference — within the Analytics & Experimentation domain for a Data Scientist role. It is commonly asked because interviewers need to assess the ability to design and analyze robust A/B tests under real-world constraints; the prompt requires both conceptual understanding of causal assumptions and practical application of power calculations, diagnostics, and monitoring procedures.
You own experimentation for an e-commerce checkout nudge. Design an A/B test randomized at the guest_id level and run for 28 days (2025-08-04 to 2025-08-31). Primary metric: completed order within 7 days of first exposure; guardrails: bounce rate and p95 page latency. Baseline 7-day per-guest conversion is 5%; minimum detectable relative lift is 8%; two-sided α=0.05; power=0.80. Average 1.6 sessions per guest with ICC=0.05. Constraints: repeat visitors across devices, 5% bot traffic, some cookie resets causing cross-arm contamination. Answer: 1) Define the estimand (ITT vs TOT) and justify the unit (guest vs session) and exposure definition with cross-device deduping and noncompliance. 2) Compute required per-arm sample size accounting for clustering (show the design effect and final n per arm). 3) Specify SRM and integrity checks (e.g., device/geo imbalance, traffic-source mix), how to detect, and how to remediate. 4) If randomization fails and you only have pre/post windows (pre: 2025-07-01–2025-07-31; post: 2025-09-01–2025-09-30), formulate a credible causal strategy (e.g., DiD with covariates/CUPED or PSM/IPW): state the identifying assumptions, write the ATE estimator, and describe how you’d test parallel trends and overlap. 5) Address interference/novelty and propose a sequential monitoring plan that controls Type I error (e.g., O’Brien–Fleming boundaries) and a plan for early stopping for harm. 6) Suppose the experiment ends with control conv=5.0% (n=120,000) and treatment conv=5.6% (n=120,000). Compute the lift, its standard error/95% CI (properly accounting for two-proportion comparison), and interpret both statistical and practical significance; would you ship under these constraints?
Quick Answer: This question evaluates skills in experimental design, causal inference, and applied statistics — including estimand selection, sample-size calculation under clustering, integrity monitoring, handling noncompliance and contamination, sequential monitoring, and two-proportion inference — within the Analytics & Experimentation domain for a Data Scientist role. It is commonly asked because interviewers need to assess the ability to design and analyze robust A/B tests under real-world constraints; the prompt requires both conceptual understanding of causal assumptions and practical application of power calculations, diagnostics, and monitoring procedures.
A/B Test Design: Checkout Nudge (Guest-Level Randomization)
You own experimentation for an e-commerce checkout flow. You're launching a checkout nudge and need to design, run, and read out an A/B test — including what to do if randomization breaks down.
This is an end-to-end experimentation case: you'll define the estimand, size the test under realistic traffic conditions, defend its integrity, fall back to an observational causal design if randomization fails, monitor it without inflating error rates, and make a final ship/no-ship call from the readout numbers.
Setup
Timeline
Run window:
2025-08-04 to 2025-08-31 (28 days).
Maturation:
because the primary metric needs a 7-day conversion window, analyze on a matured panel — either restrict to first exposures through
2025-08-24
(so every guest has a full 7-day lookback), or allow a 7-day measurement lag out to
2025-09-07
.
Experiment design
Randomization unit:guest_id
,
sticky
across sessions for the duration of the test.
Primary metric:
completed an order within
7 days
of first exposure to the checkout nudge.
Guardrails:
bounce rate; p95 page latency.
Statistical parameters
Baseline:
7-day per-guest conversion =
5%
.
Minimum detectable effect (MDE):8% relative
lift over baseline.
Significance:
two-sided
α = 0.05
.
Power:0.80
.
Clustering inputs:
average
1.6 sessions per guest
; intra-class correlation (ICC) across sessions within a guest =
0.05
.
Traffic realities (constraints)
Repeat visitors across multiple
devices
.
~
5% bot traffic
.
Some
cookie resets
, causing cross-arm contamination.
Clarifying Questions to Ask
Before designing, confirm scope with the interviewer. Strong candidates surface assumptions rather than guessing:
Allocation & ramp:
Is this a clean 50/50 split, or do we ramp from a small treatment fraction first? Is there an existing holdback we must respect?
Eligibility & exposure:
Does "exposure" mean
assigned to
the nudge, or
rendered
the nudge? Which surfaces/pages count as checkout-eligible, and are logged-out guests in scope?
Identity resolution:
What deterministic keys do we have (logged-in
user_id
, hashed email/payment token) to link a guest across devices and cookie resets, and how reliable are they?
Decision criteria:
What lift is worth shipping, and what guardrail movement is disqualifying (e.g., max tolerable bounce increase or p95 latency budget)?
Operational constraints:
Are there concurrent experiments or campaigns that could change traffic mix or interact with this nudge? Is there seasonality in the run window?
Data trust:
How are bots currently filtered, and at what stage (ingestion vs analysis)?
What a Strong Answer Covers
The interviewer is checking for these signals across the six parts (these are dimensions, not the answers):
Estimand discipline:
picks a primary estimand and
justifies
it against the business decision; treats noncompliance/leakage explicitly rather than ignoring it.
Correct unit of analysis:
reasons about why the randomization unit and the analysis unit must align, and what clustering does to standard errors.
Power arithmetic that accounts for reality:
a defensible base sample size, an explicit design-effect adjustment, and buffers for contamination/bots — not a single textbook number.
Integrity-first instinct:
checks the split
before
trusting any effect; distinguishes a randomization defect from a real treatment effect.
Credible causal fallback:
names identifying assumptions, writes an estimator, and proposes falsification tests — not just "use DiD."
Inference hygiene under peeking:
controls Type I error across interim looks and separates stop-for-benefit from stop-for-harm.
Numerate, decisive readout:
computes effect, SE, CI, and a test statistic correctly, then ties the ship decision to both statistical and practical significance
and
the guardrails.
Tasks
Estimand & unit.
Define the estimand (
ITT vs TOT
), justify
guest vs session
as the analysis unit, and define exposure — including cross-device deduping and how you handle noncompliance.
Sample size.
Compute the required
per-arm sample size accounting for clustering
. Show the
design effect
and the
final required n per arm
.
Integrity checks.
Specify
SRM
and integrity checks (e.g., device/geo imbalance, traffic-source mix): how you'd
detect
them and how you'd
remediate
issues.
Causal fallback.
Suppose randomization fails and you only have
pre/post
windows (pre: 2025-07-01 to 2025-07-31; post: 2025-09-01 to 2025-09-30). Propose a credible causal strategy (e.g.,
DiD
with covariates/
CUPED
, or
PSM/IPW
). State the
identifying assumptions
, write the
ATE estimator
, and describe tests for
parallel trends
and
overlap
.
Interference, novelty & monitoring.
Address
interference and novelty effects
, and propose a
sequential monitoring
plan that controls Type I error (e.g.,
O'Brien–Fleming
boundaries), plus a plan for
early stopping for harm
.
Readout & decision.
Suppose the experiment ends with
control conversion = 5.0% (n = 120,000)
and
treatment conversion = 5.6% (n = 120,000)
. Compute the
lift
, its
standard error
and
95% CI
for a two-proportion comparison; interpret
statistical and practical significance
; and state
whether you would ship
.
Follow-up Questions
Be ready to go deeper after the main answer:
Heterogeneity:
Treatment is positive overall but you suspect harm on low-end devices where the nudge adds latency. How would you detect a harmful subgroup without p-hacking across many slices?
Scale & traffic shortfall:
If forecasted traffic only delivers ~60% of the required per-arm
n
in the 28-day window, what are your options, and how does each affect power, MDE, or run time?
Contamination worsens:
If cookie resets push cross-arm contamination to 5%, what does that do to your observed ITT and your required sample size — and would you still trust the readout?
TOT divergence:
If the nudge renders for only 70% of assigned-treatment guests, how do ITT and TOT diverge, and which one drives the ship decision versus the per-user efficacy story?
Constraints & Assumptions
Anchor your reasoning to these (don't invent additional numbers):
Equal allocation (50/50) unless you explicitly argue for a ramp; assignment is sticky per
guest_id
.
"First exposure" anchors the 7-day conversion window; only checkout-eligible guests enter the analysis population.
Bots (~5%) and cookie resets are known data-quality risks; state where you'd filter or buffer for them rather than assuming clean data.
For the causal fallback, assume a plausibly-comparable never-exposed control group exists (e.g., a retained holdback or unaffected surface/geo) and that the metric definition is stable across pre/post windows.
Use the readout numbers exactly as given (
pc=5.0%
,
pt=5.6%
,
n=120,000
per arm).