Design and assess a video-pin increase experiment
Company: Pinterest
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: Medium
Interview Round: Technical Screen
##### Question
Pinterest wants to increase the share of video pins surfaced in the Home Feed (e.g., raising video share from a ~30% baseline toward a 45% target, an increase of roughly +10 to +15 percentage points) to boost engagement. Design a rigorous evaluation, then interpret provided results and make a recommendation.
**A) Experiment design**
1. **Unit of randomization & exposure control.** Specify the unit of randomization (user-level vs. session-level) and justify it given network/content-supply interference and feed-ranking spillovers. State how you would cap per-user exposure to the new mix (e.g., from a 30% baseline video share to a 45% target) and how you would ramp.
2. **Metrics.** State a clear primary hypothesis plus at least two alternative hypotheses (e.g., substitution effects, novelty effects, reliability cost). Define a primary success metric and guardrails with exact formulas (e.g., `saves_per_impression`, `clicks_per_impression`, `time_spent_per_user_day`, `complaint_rate = complaints / impressions`, `session_end_rate`, creator churn/follows, crash rate, bandwidth cost per user). Justify each and state win/loss directions.
3. **Power, duration & sequential analysis.** Outline your power / MDE and duration assumptions (alpha, two-sided test, allocation, variance source), and how you will handle sequential looks or peeking (e.g., group sequential boundaries or CUPED for variance reduction). Include an A/A check, a manipulation check (did video share actually move?), and a novelty/fatigue plan (minimum run and long-term holdout).
4. **Quasi-experiment.** If an RCT is infeasible, propose a credible quasi-experiment (e.g., staggered-rollout difference-in-differences with user fixed effects + inverse-propensity weighting, or synthetic control). List identifying assumptions, diagnostics, and sensitivity checks.
**B) Interpret the readout(s)** — assess statistical significance, directionality, and practical significance; call out red flags (SRM, multiple testing, heterogeneous effects, cross-unit metrics). Then recommend whether to **ship, ramp/iterate, or stop**, justify using the metrics, multiple-testing/guardrail considerations, and potential mitigations (e.g., cap video share for sensitive cohorts, rank-quality filters), and state what additional data or follow-up analysis you would run before a full rollout.
**Readout 1 — 14 days, user-level randomization, robust SEs (N ≈ 2.0M users)**
| metric | control | treatment | lift_% | p_value |
|---|---|---|---|---|
| CTR | 3.00% | 3.60% | +20.0 | 0.010 |
| avg_session_sec | 310 | 340 | +9.7 | 0.040 |
| 7d_retention | 28.0% | 27.0% | -3.6 | 0.070 |
| complaint_rate | 0.50% | 0.65% | +30.0 | 0.030 |
**Readout 2 — 14 days (alternate run, with absolute lifts and CIs)**
| metric | control | treatment | lift vs ctrl | p-value | 95% CI |
|---|---|---|---|---|---|
| CTR (clicks/impressions) | 4.70% | 4.85% | +0.15 pp | 0.040 | [+0.01, +0.29] pp |
| Saves per impression | 0.92% | 0.87% | -0.05 pp | 0.090 | [-0.11, +0.01] pp |
| Avg session time | 12.0 m | 12.1 m | +0.8% | 0.200 | [-0.3%, +1.9%] |
| Session crash rate | 1.20% | 1.32% | +0.12 pp | 0.010 | [+0.03, +0.21] pp |
| Exposed users | 200,300 | 199,700 | — | SRM p=0.62 | — |
In both cases: does the CTR win offset the negative guardrail (retention/complaints in Readout 1, crash rate in Readout 2)? If not, what mitigations or follow-ups would you require before full rollout?
Quick Answer: A Pinterest Data Scientist technical-screen question on Analytics & Experimentation: design a rigorous A/B test for increasing video-pin share in the Home Feed — unit of randomization, primary metric and guardrails, power/MDE, sequential analysis, and a quasi-experiment fallback — then interpret two 14-day readouts and recommend ship, iterate, or stop. It rewards balancing engagement lifts (CTR) against guardrail regressions (retention, complaints, crash rate), multiple-testing robustness, and cross-unit (per-impression vs per-session) trade-offs.
Solution
## A) Experiment design
### 1) Unit of randomization & exposure control
**Use user-level randomization**, not session-level. Pinterest's Home Feed is personalized and stateful across sessions, so a user randomized to "more video" should see a consistent treatment over time; this prevents within-user contamination (a user flipping between arms across sessions) and lets you measure cross-session outcomes like 7-day retention. Session-level randomization would bias retention/longitudinal metrics and create per-user mixed exposure.
**Interference / spillovers to control for:**
- **Content-supply / two-sided interference:** Showing more video to treatment users can shift creator behavior and the shared ranking model's training signals, leaking into control. Mitigate by holding the surfacing change to the serving layer for treatment users only, freezing model retraining during the test (or using a separate model), and considering cluster/geo-level randomization if supply effects are large.
- **Ranking spillovers:** If the feed reranks the same candidate pool, increasing video share for one cohort can change inventory available to others. Use a dedicated treatment-only allocation policy.
**Exposure capping & ramp:** Define the treatment as a target video share (e.g., 45%) applied as a per-session cap so a single feed never becomes all-video (protect against degenerate sessions). Ramp 1% → 5% → 20% → 50% of users, checking guardrails (crash rate, complaints, latency, bandwidth) at each stage with an automated kill switch.
### 2) Hypotheses, metrics & guardrails
**Primary hypothesis (H1):** Raising video share toward the target increases high-quality engagement (saves per impression and/or time spent) without harming reliability or trust.
**Alternative hypotheses:**
- **H2 (Substitution):** Video displaces high-intent static pins, raising clicks but lowering saves/impression and downstream value.
- **H3 (Novelty/fatigue):** Video is novel, temporarily lifting CTR; the effect decays — requires a long-term holdout to detect.
- **H4 (Reliability cost):** Heavier video load raises crash rate, jank, and bandwidth, offsetting engagement gains and harming retention/complaints.
**Primary success metric: saves per impression** = saves / impressions. On Pinterest, a save is a strong intent/quality signal tied to long-term value; CTR is clickbait-prone and time-spent can be inflated by autoplay. Per-impression normalization makes it robust to traffic-volume shifts. (`saves_per_user_day` is a reasonable alternative when you want a user-level rather than impression-level denominator; pre-register whichever you choose.)
**Secondary / diagnostic metrics:** CTR (clicks/impressions), time spent per user-day, video play starts and completion rate, creator follows.
**Guardrails (hard, must-pass), with directions:**
- `crash_rate` (crashed sessions / sessions) — must not rise beyond a pre-set bound (e.g., absolute Δ ≤ +0.05 pp or relative ≤ +5%). Lower is better.
- `complaint_rate = complaints / impressions` — must not rise. Lower is better.
- `7d_retention` (returning users / exposed users) — must not drop. Higher is better.
- `session_end_rate` / bounce — must not worsen. Lower is better.
- Bandwidth/serving cost per user — watch for cost regressions. Lower is better.
**Manipulation check:** verify treatment actually moved video share to the target (share of video impressions). If it didn't, the readout doesn't measure the intended change.
### 3) Power / MDE, duration, sequential analysis
- Fix **alpha = 0.05 (two-sided)**, **power = 0.80**, **50/50 allocation**. Derive the per-arm sample size from the variance of the primary metric (use historical data; reduce variance with **CUPED** on a pre-period covariate). Solve for the MDE you can detect given expected daily traffic, and pick a duration that (a) reaches that sample size and (b) covers at least one to two full weekly cycles to absorb weekday/weekend seasonality — commonly 14–28 days.
- **Peeking / sequential looks:** don't read p-values continuously at a fixed 0.05. Use a **group-sequential design (e.g., O'Brien–Fleming alpha-spending)** or always-valid mSPRT confidence sequences if you monitor daily, so early looks don't inflate Type-I error.
- **A/A test** before launch to validate the randomization and SE estimation (expect no significant differences).
- **Novelty/fatigue:** plan a minimum run and a **long-term holdout** (a small fraction held at baseline for weeks/months) to separate a durable lift from a novelty spike.
### 4) Quasi-experiment (if RCT infeasible)
Use a **staggered/phased rollout difference-in-differences** with user (and time) fixed effects, comparing not-yet-treated vs. treated cohorts; add **inverse-propensity weighting** to balance covariates, or a **synthetic control** at the market/geo level.
- **Identifying assumptions:** parallel trends (treated and control move together absent treatment), no anticipation, stable composition (SUTVA at the chosen unit).
- **Diagnostics:** event-study/pre-trend plots, placebo (in-time and in-space) tests, covariate-balance checks, and sensitivity to the comparison set. Prefer modern staggered-adoption estimators (e.g., Callaway–Sant'Anna) over naive two-way fixed effects, which can be biased with heterogeneous treatment timing.
## B) Interpreting the readouts
### Readout 1 (lift-% table, N ≈ 2.0M)
- **CTR +20.0% (p=0.010):** significant, large — a clear engagement-funnel lift.
- **avg_session_sec +9.7% (p=0.040):** significant, but possibly autoplay-inflated; corroborate with completion rate and saves.
- **7d_retention −3.6% (p=0.070):** *not* significant at 0.05, but the point estimate is negative — a serious early-warning sign for a longitudinal metric. Underpowered retention is common at 14 days; do not dismiss it.
- **complaint_rate +30.0% (p=0.030):** significant deterioration of a trust guardrail.
**Multiple testing:** four metrics inspected. With Holm/Benjamini–Hochberg control (or Bonferroni α ≈ 0.0125 for 4 tests), CTR (0.010) survives, session-time (0.040) and complaints (0.030) are borderline, and retention (0.070) does not reach significance. The CTR win is robust; the complaint regression is borderline-robust; the retention drop is suggestive but underpowered.
**Recommendation: do NOT ship as-is; iterate.** A strong CTR lift is undercut by a significant rise in complaints and a negative (if not-yet-significant) retention trend — exactly the substitution/quality-degradation pattern (H2). Shipping a CTR win that costs trust and possibly retention is a bad trade. Mitigations and follow-ups below.
### Readout 2 (absolute lifts + CIs)
- **CTR +0.15 pp (p=0.040, CI [+0.01,+0.29]):** significant but small (~+3.2% relative).
- **Saves/impression −0.05 pp (p=0.090, CI crosses 0):** not significant; negative trend (~−5.4% relative) — the primary quality metric is going the wrong way.
- **Avg session time +0.8% (p=0.200):** not significant.
- **Crash rate +0.12 pp (p=0.010, CI [+0.03,+0.21]):** significant deterioration (~+10% relative) — a hard reliability guardrail breached.
- **SRM p=0.62:** no sample-ratio mismatch (good; the split is trustworthy).
**Multiple testing (Bonferroni α ≈ 0.01 for ~5 metrics):** CTR (0.040) would *not* survive correction; crash (0.010) sits at the threshold and stays concerning. So the CTR win is fragile under multiplicity while the reliability regression is robust.
**Cross-unit caution:** CTR and saves are per-*impression* while crash rate is per-*session*; you cannot net them directly. Frame the trade-off as expected value per session: `EV ≈ w_click·clicks + w_save·saves + w_time·minutes − w_crash·crashes − w_complaint·complaints`. The crash weight is large (crashes drive immediate abandonment, app-store rating damage, and churn). With saves trending negative and only a fragile CTR gain, the small CTR uptick does **not** offset a robust +10% relative crash increase.
**Recommendation: do NOT ship; stop the current implementation and fix reliability, then rerun.** A reliability guardrail failure supersedes a small, non-robust engagement gain.
### Required mitigations & follow-ups (both readouts)
- **Reliability (Readout 2):** triage top crash signatures (OOM during decode, player lifecycle, GPU surfaces); lower-bitrate/shorter previews and deferred off-screen autoplay on older/low-memory devices; cap concurrent decodes; device/OS gating; real-time crash monitoring + kill switch.
- **Quality / trust (both):** rank-quality filters to surface high-quality video and exclude clickbaity/low-quality video that drives clicks without saves; cap video share for sensitive cohorts; investigate the complaint drivers (autoplay sound? irrelevant video?).
- **Measurement:** confirm the manipulation check (video share actually moved); pre-register the primary metric and guardrail thresholds; extend to 28 days + long-term holdout to test novelty decay and let retention reach power.
- **Dose-finding:** multi-cell test (0 / +5pp / +10pp / +20pp) to find the share that maximizes saves while keeping crash, complaints, and retention within guardrails.
- **Heterogeneity & downstream:** slice by device/OS/network, new vs returning, heavy- vs light-video users, content vertical, geo; measure day-1/day-7 retention and saves-to-return attribution for video vs static.
- **Stats hygiene:** use a pre-specified analysis plan and Holm/BH correction for secondaries; respect the sequential-testing boundaries.
**Bottom line:** In both readouts a CTR lift is paired with a guardrail regression (complaints/retention in Readout 1; crash rate in Readout 2) and no convincing high-quality-engagement gain. Do not ship: iterate on quality and reliability, add the missing checks, run a dose-finding test with a longer horizon, and re-evaluate against the pre-registered primary metric and guardrails.
Explanation
Rubric: (1) chooses user-level randomization and reasons about two-sided/content-supply interference and ranking spillovers; (2) picks a quality-oriented primary metric (saves/impression) over CTR with hard reliability/trust guardrails and exact formulas; (3) covers power/MDE, CUPED, sequential-look correction, A/A, manipulation check, and a novelty long-term holdout; (4) offers a credible quasi-experiment with stated identifying assumptions and diagnostics; (B) reads each table for significance, direction, practical magnitude, multiple-testing robustness, SRM, and cross-unit (per-impression vs per-session) caution, then recommends NOT shipping because a guardrail regression outweighs a small/fragile CTR gain, with concrete mitigations and follow-ups. A strong answer resists shipping on the CTR win alone and treats the negative retention/complaint/crash signals as decision-controlling.