Communicate and de-risk a non-experimental launch
Company: Reddit
Role: Data Scientist
Category: Behavioral & Leadership
Difficulty: hard
Interview Round: Technical Screen
You’ve estimated a positive impact with Synthetic Control and must recommend whether to launch. How do you communicate evidence strength, residual risks, and key assumptions to executives and skeptical partners? Outline a staged rollout with kill-switches, guardrail SLAs, and ownership for real-time monitoring. Define explicit rollback criteria and a pre-committed decision memo structure (problem, stakes, method, diagnostics, results, sensitivity, risks, decision thresholds, contingency plan, sign-offs). Describe how you’ll handle dissent (e.g., legal, infra), prevent p-hacking/confirmation bias, and structure a post-launch review that could reverse the decision if guardrails regress.
Quick Answer: This question evaluates a data scientist's competency in interpreting causal-inference results from a Synthetic Control, communicating evidence strength and residual risks to executives and skeptical partners, structuring rollout governance and rollback criteria, and managing bias control, dissent and cross-functional accountability.
Solution
# Executive Summary Approach
- Deliver a one-page executive readout with a traffic-light recommendation (green/yellow/red) and a pre-committed decision rule.
- Pair it with an appendix for skeptics: full diagnostics, sensitivity analyses, and assumptions clearly stated.
- State what would change your mind (falsification tests, guardrail breaches) and by when.
# Communicating Evidence Strength
1) Plain-language takeaway
- Example: "The change increased daily sessions by about +2.1% (95% interval: +0.8% to +3.4%) with no detectable impact on error rate or latency."
2) Core Synthetic Control concepts (brief)
- Synthetic control constructs a weighted combination of control units to match the treated unit’s pre-treatment trajectory.
- Estimate: effect_t = Y_treated,t − Σ_j w_j Y_control_j,t for t in post-period.
- We typically average effects over the post-period, and quantify uncertainty via in-space placebo/permutation tests.
3) Critical diagnostics (show visuals in appendix; summarize in exec page)
- Pre-treatment fit: RMSPE_pre small; show gap plot centered on zero before treatment.
- Placebo/permutation inference: treated effect vs. distribution of placebo effects; report p-value and rank.
- RMSPE ratio: RMSPE_post / RMSPE_pre; large ratios in placebo units indicate specificity.
- Donor pool sanity: weights non-pathological; no single donor dominates unless justified; leave-one-out sensitivity stable.
- In-time placebo: treatment assigned earlier yields null effects.
- Robustness: augmented synthetic control, alternative donor pools, excluding contemporaneous shocks.
4) Quantify uncertainty and practical significance
- Report effect size with interval and decision-relevant translation (e.g., revenue/day, DAU, cost savings).
- Pre-commit a minimum effect worth shipping (e.g., +1.0% sessions net of risks).
# Residual Risks and Key Assumptions
- Assumptions
- Convex hull coverage: treated unit’s pre-period can be approximated by donors.
- No unobserved confounder that changes exactly at treatment and uniquely affects the treated unit.
- Stable data-generating process across pre/post; limited spillovers between treated and donor units (SUTVA).
- Residual risks
- Interference/spillovers (e.g., cross-region traffic migration).
- Concurrent shocks (marketing, outages, policy changes).
- Non-stationarity/seasonality; novelty effects that decay.
- Measurement issues: logging changes, bot traffic, metric drift.
- Generalizability: treated cohort vs. global population differences.
# Staged Rollout Plan with Guardrails and Kill-Switches
1) Phased rollout
- Phase 0: Internal/opt-in (dogfood) for 2–3 days to validate logging and UX.
- Phase 1: 1% random traffic for 48–72 hours; focus on stability and core guardrails.
- Phase 2: 10% for 3–7 days; run near real-time monitoring and synthetic control or diff-in-diff on staggered enablement.
- Phase 3: 50% for 1–2 weeks; confirm persistence and heterogeneity by platform/region.
- Phase 4: 100% if thresholds met for two consecutive review cycles.
2) Guardrail SLAs (examples; tailor to product)
- Reliability: error rate ≤ baseline + 0.05 pp; crash rate ≤ baseline + 5%.
- Performance: p95 latency ≤ baseline + 10% or within 50 ms, whichever is smaller.
- Engagement/health: session depth ≥ baseline − 0.5%; retention D1/D7 within −0.25 pp; content/abuse reports not worse than +2%.
- Revenue/monetization: RPM ≥ baseline − 0.5% unless uplift elsewhere compensates per pre-committed tradeoff rule.
3) Kill-switches
- Feature flags with instant rollback; configs per platform/region.
- Auto-disable if any critical guardrail crosses threshold for N consecutive minutes (e.g., 15–30) to reduce false positives.
4) Ownership and real-time monitoring
- RACI
- DRI (PM): decision calls and stakeholder comms.
- Data Science: causal measurement, guardrail design, analysis, daily updates.
- Eng/SRE: alerts, on-call runbook, rollout/rollback execution.
- Legal/Trust & Safety/Infra: sign-offs and policy/compliance checks.
- Monitoring setup
- Single live dashboard with primary/secondary metrics, segmented by platform/geo.
- Alerting: paging thresholds and burn-rate alerts; Slack/incident channel with on-call rotation.
# Explicit Rollback Criteria
- Immediate rollback if any critical guardrail breach persists beyond the auto-disable window or repeats ≥2 times in 24h.
- Programmatic rollback if:
- Primary KPI uplift < pre-committed minimum effect for two review windows (e.g., 72h rolling) and no offsetting benefits per the tradeoff rule.
- Safety/quality metrics degrade beyond thresholds after controlling for exogenous events.
- Manual override possible only with a signed exception by PM+Eng+Legal after documented risk assessment.
# Pre-Committed Decision Memo Structure
- Problem: What decision and why now; link to strategy.
- Stakes: Business impact range (best/base/worst), user risk, cost of delay.
- Method: Why synthetic control; unit of analysis, donor pool, treatment date.
- Diagnostics: Pre-fit quality, placebos, RMSPE ratios, LOO tests, robustness variants.
- Results: Point estimates, intervals, practical translations; segment heterogeneity.
- Sensitivity: Alternative specs, donor exclusions, augmented SC, in-time placebos.
- Risks: Assumptions, interference, measurement, operational risks.
- Decision thresholds: Ship if uplift ≥ X% and guardrails within Y; otherwise hold.
- Contingency plan: Rollback playbook, comms, engineering steps, re-run plan.
- Sign-offs: Names/titles/date; dissent documented if applicable.
- Appendices: Plots, code refs, data QA, event logs.
# Handling Dissent
- Pre-mortem session: identify failure modes, log mitigations.
- Red-team review: assign a skeptic to challenge assumptions and donor pool choices.
- Minority report: dissenting stakeholders append a written perspective to the memo; decision-maker acknowledges in writing.
- Escalation path: clear timeline for raising legal/infra concerns; block launch only for specified classes of risk (privacy, security, compliance, SLO breach).
# Preventing P-Hacking and Confirmation Bias
- Pre-registration/pre-analysis plan
- Primary and secondary metrics; analysis window; exclusion rules; minimum sample duration.
- Donor pool and tuning choices fixed before peeking.
- Holdout cohorts or time blocks reserved for confirmation.
- Multiple testing control for secondaries (e.g., Benjamini-Hochberg) and clear separation between confirmatory vs. exploratory analyses.
- Sequential analysis with alpha spending or group-sequential boundaries to avoid repeated-peeking bias.
- Dashboard hygiene: freeze definitions; version metrics; document any post-hoc changes.
- Independent replication by a second analyst for code/data QA.
# Post-Launch Review and Possible Reversal
- Cadence: 24h, 72h, and weekly reviews for 4 weeks; then monthly.
- Methods: Continue synthetic control or switch to staggered diff-in-diff as more cohorts roll in; monitor change-points (Shewhart/CUSUM) for guardrails.
- Reversal triggers
- Any critical guardrail crosses threshold for two consecutive weekly windows.
- Degradation trends with statistically credible change-point and practical significance.
- New information (e.g., policy or legal risk) invalidates assumptions.
- If triggered
- Execute rollback runbook within target time (e.g., ≤30 minutes).
- Open incident with blameless postmortem; identify root cause; define remediation and re-validation plan.
# Small Numerical Example (for intuition)
- Pre-period RMSPE = 0.9 units; post-period average effect = +2.1% sessions.
- Placebo test: treated effect ranks 5th largest of 100 placebos → permutation p ≈ 0.05.
- 95% interval via placebo distribution: [+0.8%, +3.4%].
- Decision threshold: ship if uplift ≥ +1.0% and all critical guardrails within SLA. Result meets both → recommendation = green to proceed to Phase 1 with kill-switches.
# Common Pitfalls and Guardrails
- Overfitting pre-period: use augmented SC or penalization; validate with in-time placebo.
- Donor contamination: exclude geos/platforms with potential spillovers.
- Concurrent changes: freeze other experiments in treated and high-weight donor units; maintain an event log.
- Non-stationarity: ensure sufficient pre-period length; include seasonality controls; extend monitoring horizon.
This plan makes assumptions and decision rules explicit up front, reduces incentives to data-dredge, creates operational guardrails with ownership, and establishes conditions under which the decision will be reversed if reality diverges from the initial estimate.