Experimentation and Metrics Design

What's being tested

Interviewers are probing your ability to design, instrument, and analyze experiments that evaluate ML changes safely and interpretably. They expect you to pick the correct randomization unit, define robust evaluation metrics (including guardrails), calculate power and variance-reduction needs, and anticipate production biases introduced by model-driven exposure or logging. At Netflix, this ensures model changes improve long-term engagement without regressions in quality, latency, or fairness.

Core knowledge

Randomization unit: choose between user-, session-, or impression-level randomization depending on treatment scope and interference; user-level avoids cross-treatment contamination but increases required sample size.
Average Treatment Effect (ATE): estimate with $\widehat{ATE}=\bar{Y}_{T=1}-\bar{Y}_{T=0}$ and standard error $SE=\sqrt{\frac{s_1^2}{n_1}+\frac{s_0^2}{n_0}}$ for difference of means.
Sample size / power: for two-sided test, $n\approx\frac{(z_{1-\alpha/2}+z_{power})^2(\sigma_1^2+\sigma_0^2)}{\delta^2}$ ; use realistic baseline variance (pre-experiment logs) not optimistic guesses.
Exposure definition & logging: log treatment assignment, deterministic bucket ID, exposure event timestamps, and full context features; ensure stable assignment across services via hashing or feature flags.
Counterfactual and IPS: for logged-policy evaluation, use Inverse Propensity Scoring (IPS) weighting $\hat{V}=\frac{1}{N}\sum \frac{\mathbb{1}\{a=a_i\}r_i}{p(a_i|x_i)}$ but watch high variance and need for clipping.
Metric choice for ranking/recommenders: prefer exposure-aware metrics (NDCG@k, precision@k, normalized watch_time) and guardrails like start_rate, completion_rate, CPU/latency.
Variance reduction: apply CUPED or stratification using pre-period covariates to reduce required sample sizes; control for strong predictors to tighten CI.
Sequential testing / stopping rules: avoid naive peeking; prefer alpha-spending, O'Brien–Fleming boundaries, or pre-registered Bayesian decision rules to control Type I error.
Multiple comparisons: apply Bonferroni or Benjamini–Hochberg (FDR) when testing many metrics or variants; pre-specify primary metric.
Interference & SUTVA violations: in recommendations, treatment changes who sees what content; model-triggered exposure breaks SUTVA — consider cluster randomization or analysis methods that account for interference.
Online vs offline parity: validate offline metrics (AUC, NDCG) against online outcomes and instrument a canary rollout; drift in input distributions will break offline-to-online correlation.
Significance vs. business impact: always report effect size, confidence intervals, and expected absolute impact (e.g., seconds of watch-time per user), not just p-value.

Worked example — "Design an A/B test to evaluate a new ranking model"

First 30s framing: ask what the treatment replaces (full ranking pipeline or re-ranker), define the randomization unit (user vs impression), and confirm primary and guardrail metrics plus acceptable latency/compute constraints. Skeleton of answer: (1) Randomization and assignment — hash user_id to ensure deterministic buckets and persist across services; (2) Instrumentation — log assignment, exposures, ranked list, positions, and downstream engagement; (3) Power and sample size — estimate baseline variance from historical watch_time or click logs, compute needed ramp; (4) Analysis plan — pre-specify primary metric (e.g., NDCG@10 → online start_rate or watch_time), use CUPED for variance reduction, and plan subgroup / heterogeneity checks; (5) Rollout & guardrails — abort triggers for latency/quality drops. Key tradeoff: user-level randomization reduces interference but increases sample/time to detect small lifts; session- or impression-level gives faster signals but risks contamination. Close by saying: if more time, instrument counterfactual logging to run IPS and offline simulations, and pre-register sequential boundaries for adaptive ramping.

A second angle — "Evaluating continuous model updates (daily retrain) in production"

Same experimental principles apply but constraints differ: treatments are transient and overlapping, causing carryover and nonindependence. Use staggered rollout with persistent bucket assignment so users see either stable "control" or "treatment-updates" stream; measure both short-term lifts and longer-term retention to detect novelty effects. Account for time-varying confounding by including calendar/time covariates and pre-period baselines; use rolling-window power calculations because variance will change as model improves. Consider canary cohorts and automated rollback if drift in input distributions or latency regressions exceed thresholds. Here the tradeoff is velocity (rapid model improvement) versus experimental integrity (stable assignment).

Common pitfalls

Pitfall: Randomizing at the wrong unit — Many candidates default to impression-level randomization; this hides user-level effects and breaches independence, leading to diluted or misleading estimates. Always justify unit choice and its consequences on power and interference.

Pitfall: Relying only on p-value — Reporting statistical significance without effect size, confidence intervals, or absolute impact (e.g., hours of watch-time) leads to poor product decisions; quantify end-user impact.

Pitfall: Ignoring instrumentation fidelity — Failing to log deterministic bucket ID, treatment version, or exposure times creates unrecoverable analysis errors; require logging and a reproducible analysis DAG before rollout.

Connections

Interviewers may pivot to offline policy evaluation / counterfactual estimation, model monitoring & drift detection, or causal inference for heterogeneous treatment effects. Be prepared to discuss how experiment findings feed model retraining pipelines and monitoring systems.