This question evaluates a candidate's competency in experimental design, A/B testing and statistical analysis for product decisions, specifically testing metric selection, reliability and latency guardrails, stratified randomization, interference and clustering effects, heterogeneity detection, and operational stopping rules.

Context: You need to choose a default cap for group calls (maximum concurrent participants) among {4, 8, 16}. The decision must be made within 4 weeks and must protect call reliability and latency. Assume 8 is the current production default unless stated otherwise.
Specify an experiment that covers:
(a) Design choice: fixed multi-arm vs. bandit under a 4-week deadline. Justify exploration vs. exploitation trade-offs.
(b) Metrics: choose a primary business metric (e.g., successful starts per eligible user) and define reliability/latency guardrails (e.g., start success rate, join latency p95/p99, crash rate), including thresholds.
(c) Randomization: stratified randomization to balance device class, network type, and region. Explain how you will handle cluster/household effects and interference (participants from different arms joining the same call).
(d) Analysis: detect non-monotonic effects (8 could outperform both 4 and 16) and heterogeneity by segment (e.g., friends vs. workgroups). Outline an analysis that controls false discovery (e.g., hierarchical modeling or Holm–Bonferroni).
(e) Operations: stopping rules, interim looks, and rollback thresholds if tail latency SLOs are breached. Include how you will cap exposure to protect reliability while still enabling a decision in 4 weeks.
Login required