PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Analytics & Experimentation/Netflix

Plan and analyze a ranking A/B test

Last updated: Mar 29, 2026

Quick Overview

This question evaluates experimental-design and causal-inference competencies for online A/B testing, covering metric definition, randomization strategy under cross-session carryover and interference, power and sample-size calculations, sequential monitoring, heterogeneity analysis, and safe rollout and ML retraining considerations.

  • hard
  • Netflix
  • Analytics & Experimentation
  • Data Scientist

Plan and analyze a ranking A/B test

Company: Netflix

Role: Data Scientist

Category: Analytics & Experimentation

Difficulty: hard

Interview Round: Onsite

A search team proposes a new ranking feature. Design, execute, and analyze the experiment: (1) Unit of randomization: decide between user-level, session-level, or query-level, and justify given cross-session carryover and possible network/interference. (2) Metrics: define primary success metric (e.g., query-level success rate or paid conversion within 24h) and guardrails (latency, crash rate, ads revenue, bounce). (3) Power and sample size: baseline click-through rate is 10%; you need a relative +2% uplift (to 10.2%), two-sided alpha=0.05, power=0.8. Show the formula and compute the required per-variant sample size for a standard two-proportion z-test; then discuss how clustering or CUPED would change it. (4) Execution: outline SRM checks, triggered vs intent-to-treat analyses, bucketing consistency across services, novelty effects burn-in, and sequential monitoring without inflating Type I error. (5) Heterogeneity: propose pre-registered segments (e.g., head vs tail queries, country, device) and how you’d test for interaction while controlling false discovery. (6) Interference and long-term effects: if ranking changes affect supply/demand dynamics, propose cluster-randomization or switchback testing and how to interpret results. (7) Rollout: define stop/go criteria, ramp plan, and how to update the ML training data to avoid entangling training with experiment exposure.

Quick Answer: This question evaluates experimental-design and causal-inference competencies for online A/B testing, covering metric definition, randomization strategy under cross-session carryover and interference, power and sample-size calculations, sequential monitoring, heterogeneity analysis, and safe rollout and ML retraining considerations.

Related Interview Questions

  • Estimate ATE of personalization on streaming - Netflix (medium)
  • Compute ITT, TOT, and LATE with noncompliance - Netflix (medium)
  • Estimate ATE, ITT, and TOT from experiment - Netflix (easy)
  • Design experiment on culture memo emphasis - Netflix (medium)
  • Design and power a frequency-cap experiment - Netflix (hard)
Netflix logo
Netflix
Oct 13, 2025, 9:49 PM
Data Scientist
Onsite
Analytics & Experimentation
8
0

Experiment Design: New Search Ranking Feature

Context

You are designing, running, and analyzing an online controlled experiment to evaluate a new search ranking feature for a consumer app with logged-in users. The feature may exhibit cross-session carryover (users learn or form habits) and could create network/interference effects (e.g., popularity feedback loops, shared caches, or ranking signals that influence others).

Tasks

  1. Unit of randomization
    • Choose among user-level, session-level, or query-level randomization.
    • Justify your choice given likely cross-session carryover and potential network/interference.
  2. Metrics
    • Define a primary success metric (e.g., query-level search success or downstream conversion within 24 hours), including precise measurement windows and inclusion criteria.
    • Define guardrail metrics (e.g., latency, crash rate, ads revenue if applicable, bounce rate) and how they’ll be monitored.
  3. Power and sample size
    • Baseline click-through rate (CTR) is 10%; you seek a relative +2% uplift (to 10.2%). Use a two-sided α = 0.05 and power = 0.8.
    • Show the standard two-proportion z-test sample size formula and compute the required per-variant sample size.
    • Discuss how clustering (e.g., user-level correlation) or CUPED would change the requirement.
  4. Execution plan
    • Outline SRM checks; triggered vs. intent-to-treat analyses; bucketing consistency across services; novelty effects/burn-in; and sequential monitoring without inflating Type I error.
  5. Heterogeneity
    • Pre-register segments (e.g., head vs. tail queries, country, device) and describe how you would test for treatment-by-segment interaction while controlling false discovery.
  6. Interference and long-term effects
    • If ranking changes affect supply/demand dynamics or popularity feedback, propose cluster-randomization or switchback testing and how to interpret results.
  7. Rollout
    • Define stop/go criteria and a safe ramp plan.
    • Explain how to update ML training data post-experiment to avoid entangling model training with experimental exposure.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Analytics & Experimentation•More Netflix•More Data Scientist•Netflix Data Scientist•Netflix Analytics & Experimentation•Data Scientist Analytics & Experimentation
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.