Experiment Design For Product ML

What's being tested

The interviewer is probing your ability to design credible, operational experiments that evaluate ML-driven product changes for new-user onboarding. They want to see signal definition (what "activation" means), experimental randomization and power reasoning, telemetry and model-serving considerations, and cross-functional launch/monitoring tradeoffs — all from the Machine Learning Engineer perspective (model pipelines, serving, telemetry parity, and safe rollout), not product strategy or data-platform plumbing.

Core knowledge

Primary metric (activation): define a single, business-aligned primary metric (e.g., `activated_new_user` boolean within 7 days) and complementary metrics (engagement, retention, negative-signal rates). Pre-specify direction and metric computation logic.
Randomization unit & interference: randomize at the user level unless onboarding touches shared resources (then consider `session` or `device`); account for spillover when users influence each other (e.g., invites).
Sample size & MDE: compute sample size from baseline rate $p_0$ , desired power (1-β), significance α, and minimum detectable effect (MDE). For proportions use approximate formula: $n \approx \frac{(z_{1-\alpha/2}\sqrt{2p(1-p)} + z_{1-\beta}\sqrt{p_1(1-p_1)+p_2(1-p_2)})^2}{(p_1-p_2)^2}$ and adjust for expected attrition.
Sequential testing & stopping rules: avoid naive peeking; use alpha-spending or pre-registered sequential procedures (e.g., Pocock, O’Brien–Fleming) or Bayesian approaches. Document stopping rule upfront.
Stratification & blocking: stratify randomization by high-variance covariates (country, platform `mobile`/`web`, acquisition source) to improve power and reduce imbalance; maintain deterministic hashing for consistent assignment.
Instrumentation parity (online vs offline): ensure the serving model and offline simulation use identical features, versions, and pre-processing; log inputs, predictions, and decisions with consistent keys (`user_id`, `experiment_id`, `treatment_arm`, `timestamp`).
Guardrail & safety metrics: pre-specify guardrail metrics (e.g., `message_abuse_rate`, `p99_latency`, session-drop), thresholds for rollback, and automated alerts tied to `p-value`-agnostic changes (absolute deltas).
Heterogeneous treatment effects (HTE): plan HTE analyses (by platform, region, user cohort) but correct for multiple comparisons (Bonferroni / Benjamini-Hochberg) and treat HTE exploration as hypothesis-generating unless pre-specified.
Model-related rollout: prefer staged rollouts (canary → ramp) with deterministic bucketing and the ability to rollback model version; measure online/offline metric parity and monitor feature drift and data-skew post-deploy.
Analysis plan & pre-registration: pre-register primary metric, statistical test (e.g., two-sample proportion z-test or permutation test), covariate adjustments (logistic regression), and subgroup analyses to avoid p-hacking.
Bias from missingness and attrition: track differential drop-out; if attrition correlates with treatment, use intention-to-treat (ITT) and consider instrumental-variable or sensitivity analyses.
Latency and user experience tradeoffs: model changes may increase inference latency; quantify how additional `p99` latency affects activation and balance latency cost vs metric gain.

Worked example — Improve Reddit onboarding

Frame: ask clarifying questions in the first 30 seconds: "How do we define activation? Which platforms and user acquisition sources are in-scope? Are there any safety or latency constraints? What's the minimum worthwhile effect (MDE) and acceptable risk?" Skeleton of a strong answer: (1) State hypothesis (e.g., personalized onboarding increases 7-day activation), (2) Design experiment (user-level randomization, stratification by platform, canary rollout), (3) Metric & power (compute sample size for target MDE on activation rate), (4) Instrumentation & serving (log features, model version; ensure online/offline parity), (5) Monitoring & rollback (guardrails for abuse/latency).

Flag an explicit tradeoff: if activation baseline is low, detecting small percentage improvements requires large samples or longer duration — you might prioritize larger MDE or run a targeted segment test. Close: "If I had more time I'd run an offline replay on historical new-user logs to validate the personalized model's propensity scores and simulate expected lift, and pre-register HTE analyses for newcomers vs app-installs."

A second angle — personalization vs friction reduction

Another way to ask onboarding experiments is to compare a personalization-based model (ML-driven feed suggestions) against a friction-reduction change (fewer required steps). The same experiment-design concepts apply, but the constraints shift: personalization demands richer feature pipelines, careful online/offline parity checks, and privacy considerations, while friction changes are UI/flow experiments where latency rarely matters but measurement (tracking intermediate funnel steps) is critical. For personalization you may need smaller, more statistically noisy per-cohort effects and stronger HTE analyses; for friction-reduction you can often get larger immediate lift and easier instrumentation. The ML Engineer should emphasize model-serving implications for personalization and simpler telemetry for flow changes.

Common pitfalls

Pitfall: Defining activation vaguely. Saying "more engaged" without a precise, time-bounded metric (e.g., `activated` within 7 days) makes hypothesis testing impossible and invites post-hoc metric fishing.

Pitfall: Ignoring online/offline parity. Running a model offline without matching feature-generation leads to optimistic results; always validate with a feature-replay and shadow-serving before experiment rollout.

Pitfall: Over-interpreting subgroup findings. Running many post-hoc splits and reporting the largest positive lift without correction will look impressive but is statistically misleading; call exploratory HTE what it is.

Connections

Interviewers may pivot to causal inference techniques (instrumental variables, propensity scores) for messy assignment, or to model monitoring topics (drift detection, feature-importance shifts) once you discuss serving and telemetry. Be ready to move from experiment design into operational ML concerns like feature-store reproducibility and deployment rollback mechanisms.

What's being tested

Core knowledge

Worked example — Improve Reddit onboarding

A second angle — personalization vs friction reduction

Common pitfalls

Connections

Further reading

Practice questions

Related concepts