Experimentation, A/B Testing, And Product Metrics

What's being tested

Interviewers are probing whether you can design, evaluate, and defend model-driven experiments end-to-end as a Machine Learning Engineer: choose appropriate metrics, ensure statistical validity, detect instrumentation or modeling problems, and translate results into safe deployment decisions. They care that you can own the measurement pipeline (offline → online parity), reason about short- vs long-term effects, and communicate uncertainty and mitigation plans clearly to stakeholders.

Core knowledge

Randomization/unit of assignment: pick the correct randomization unit (user, session, device, cookie) to avoid interference and preserve independence; justify choice by product causal path and frequency of interaction.
Primary vs guardrail metrics: define one clear primary metric (e.g., CTR, CVR, revenue per DAU) and several guardrails (engagement, latency, error-rate, fairness) that must not regress; instrument both before rollout.
Hypothesis, effect size, and power: perform sample-size calculation using the two-sample proportions formula
$n = \frac{(Z_{1-\alpha/2}+Z_{1-\beta})^2\,(p_1(1-p_1)+p_2(1-p_2))}{(p_2-p_1)^2}$
and state minimal detectable effect (MDE) in product-relevant terms.
Type I/II and multiple comparisons: control Type I error (α) with pre-registration and adjust for multiple arms via Bonferroni or control False Discovery Rate (Benjamini–Hochberg) when running many metrics or segments.
Sequential testing: if checking results repeatedly, use alpha-spending or group-sequential methods (e.g., O’Brien–Fleming) or platform-supported sequential tests; avoid naive peeking.
Pre-launch sanity checks: run A/A tests, check sample ratio mismatch (SRM) with chi-square, verify covariate balance, and validate logging keys and feature parity between offline training and online inference.
Offline ↔ online parity: ensure features used in training are available and identically computed at serving (feature-store parity); measure offline metrics (AUC, NDCG, calibration) and map them to online KPI expectations.
Heterogeneous treatment effects (HTE): analyze slices and consider uplift modeling when effects vary across cohorts; correct for data-snooping by pre-specifying key subgroups or using cross-validation for discovery.
Drift & monitoring: set alerts for input data drift (feature distribution shifts), label delay, prediction distribution shifts, and monitor model calibration, p99 latency, and error-rate in production.
Interference and network effects: anticipate spillover (violated SUTVA) in social or shared-resource settings; use cluster or hierarchical randomization or design cluster-level experiments.
Long-term treatment effects: plan for metrics over time (retention, lifetime value), use holdout groups or staggered rollouts, and estimate cumulative impact instead of only short-term lifts.
Diagnosis workflow for failures: triage by (1) instrumentation bug, (2) SRM or randomization bug, (3) data-labeling issues, (4) model/feature drift, (5) real negative treatment — use canary and shadow traffic to isolate.

Tip: Pre-register the experiment (metrics, MDE, duration) and store a reproducible analysis pipeline (Jupyter/Rmarkdown) for auditability.

Worked example — "Respond to long-term concerns after A/B success"

Frame: first clarify the observed result (which metrics improved, over what duration, sample size, and which guardrails were tracked). Ask if the effect is concentrated in new users or existing users, and whether any downstream metrics (retention, complaints, latency) showed trends.

Skeleton answer pillars:

Validate short-run claim: check SRM, A/A tests, and instrumentation logs; re-run statistical test with pre-registered analysis and sequential-adjustment if peeking occurred.
Assess external validity: analyze cohorts (new vs returning users), geographical splits, and seasonality to test robustness.
Estimate long-term impact: run a staged rollout with holdout or randomized control for longer windows, or project retention/LTV using survival analysis.
Operationalize guardrails: deploy canary with rollback criteria, set monitoring dashboards for retention, support tickets, and fairness metrics.

Tradeoff to flag: a quick full rollout maximizes short-term gain but risks long-term churn; a conservative phased rollout trades speed for safety. Close: "If I had more time, I'd set randomized long-holdout groups and run uplift models to identify cohorts where the change backfires."

A second angle — "Explain modeling challenges and fixes"

Same evaluation mindset applies when debugging a deployed model: start by validating instrumentation (SRM, logging, feature parity). Organize response around (1) reproduce offline with recent data, (2) run shadow traffic to compare production inputs/predictions to offline expectations, (3) detect drift and recalibrate (e.g., Platt scaling for probabilities), and (4) implement quick mitigations (threshold adjustments, fallback models) vs longer-term fixes (retraining, feature fixes). Emphasize tradeoffs between immediate rollback (costly but safe) and targeted mitigations (safer for partial regressions). Communicate uncertainty and an evidence-backed timeline for retraining or phased redeploy.

Common pitfalls

Pitfall: Analytic tunnel vision — only reporting a single statistically significant metric.
Surface other metrics and guardrails; pre-define primary metric and publish the full set of monitored KPIs so stakeholders don’t cherry-pick.

Pitfall: Mistaking statistical significance for practical importance.
Always translate lifts into business or user-impact terms (absolute change, N-percent of users affected, projected revenue) and report confidence intervals and MDE.

Pitfall: Overlooking instrumentation and offline/online mismatch.
Don’t assume training features equal serving features; missing features or latency can cause models to fail silently — use shadow traffic and end-to-end canaries.

Connections

Interviewers may pivot to causal inference (instrumental variables, difference-in-differences), MLOps topics (canary, CI/CD for models, feature stores), or deeper fairness and bias evaluation (metric parity by subgroup), so be prepared to bridge experimentation reasoning to these adjacent areas.

What's being tested

Core knowledge

Worked example — "Respond to long-term concerns after A/B success"

A second angle — "Explain modeling challenges and fixes"

Common pitfalls

Connections

Further reading

Practice questions

Related concepts