Prove friends outperform unconnected; design metrics, observational analysis, and rollout experiment
Company: Meta
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: hard
Interview Round: Technical Screen
##### Question
You are given two event tables, `info_stream_views` (one row per viewer–post view, with `viewer_id`, `post_id`, `relationship` ∈ {friend, unconnected}, `view_duration_ms`, `event_ts`, `ds`) and `post_reactions` (one row per reaction, with `reactor_id`, `post_id`, reaction type ∈ {like, comment, reshare, follow, hide, report}, `event_ts`, `ds`).
You hypothesize that **content authored by Friends is "more social" than content from Unconnected sources** (i.e., it drives more likes/comments/reshares per view). Using only these two tables, design a rigorous, end-to-end analysis: define metrics, validate the hypothesis observationally, design an experiment for launching/expanding Unconnected content, and quantify the value of Unconnected exposure even when its near-term engagement is lower.
1. **Define and justify metrics (formulas welcome).**
- Precisely define "more social" with measurable, denominator-complete metrics: e.g., social-reactions per DAU, reactions per impression by relationship, reaction-rate per view, comment/reshare rate per 100 (or 1,000) impressions, dwell-time lift (share of views ≥ 60s), and same-day view-to-reaction conversion. Consider a weighted composite (e.g., weight comments/reshares above likes) and justify the weights.
- State the **unit of analysis** (viewer–post–day vs. post–day vs. impression), whether to include zero-reaction views, and how the choice affects bias.
- Specify **attribution**: join each reaction to the view by `(viewer_id, post_id)` (and `ds`), attribute it to the `relationship` in which the viewer saw the post, handle multiple views of the same post per viewer per day (e.g., aggregate to MAX duration / one impression), define a lookback window for reactions lacking a same-day matched view, and report the unattributed-reaction rate as a QA metric.
- Propose **normalization** (per impression, per unique viewer, per minute viewed) and **guardrail metrics**: daily active viewers, views/session, creator/topic diversity (unique authors per viewer–day, entropy), and quality guardrails (hide rate, report rate, negative share of reactions).
2. **Observational validation.** Outline an analysis to compare Friend vs. Unconnected engagement while mitigating confounding over a fixed window (e.g., a 7-day window such as 2025-08-26..2025-09-01).
- List key confounders (viewer propensity to engage, author popularity, post age/freshness, content type via proxies like view duration, time-of-day, device, rank position) and control for them.
- Propose a primary design — e.g., a **fixed-effects regression** with viewer×day fixed effects (and optionally author fixed effects) — and a secondary design (**propensity-score matching / inverse-propensity weighting**, or a **doubly-robust AIPW** estimator). State the unit of analysis, covariates, and outcome window (e.g., reaction within 24h of first view).
- Specify standard-error treatment (cluster by viewer and/or post), multiple-comparison control (one pre-specified primary metric; FDR on secondaries), and diagnostics (overlap/common support, post-adjustment covariate balance, placebo using hide/report outcomes, robustness across post-age buckets).
3. **Experiment design — launching/expanding Unconnected content.** Propose a randomized experiment to measure success.
- Randomization unit (user-level, sticky); treatment variants — either reserve a share of feed slots for Unconnected content (e.g., 0% / 10% / 30%) or scale the relationship ranking weight (e.g., θ = 1.0 control vs. θ = 0.8 to upweight Unconnected). Define primary outcomes (net social reactions per DAU, reactions per session, time-to-first-friend-interaction), guardrails (retention/D+1 return, session length, hide/report rates, creator follows, friend-ecosystem health, long-term re-engagement), and minimal acceptable lifts.
- Provide a **power/duration check** (baseline rate, MDE, α, power, clustering design effect) and a **variance-reduction plan** (CUPED with pre-exposure per-user baseline; diff-in-diff for long-run panels).
- Address **novelty effects, personalization/learning ramp, supply constraints** (log intended vs. achieved Unconnected share; ITT + exposure-on-treated), and **network/peer interference** (graph-cluster randomization, supply/author holdouts, or interleaved time-split ramps).
- Include **segment / heterogeneity analyses** (new vs. power users; friend-graph density; region; consumption-style deciles) with multiple-testing control, and a clear ship/iterate decision framework.
4. **Value of Unconnected content beyond near-term engagement.** Even if immediate engagement is lower, define and measure the incremental value of Unconnected exposure using only the given tables (and call out what extra logs you'd request):
- **Discovery value**: new viewer–author pair rate, repeat-return-to-author rate, creator breadth and topical entropy per viewer–day.
- **Long-term value**: next-day (D+1) and 7-day retention, session depth.
- **Amplification**: reshare-driven downstream reach (incremental views following a reshare).
Present a trade-off view (per-1,000-impression KPIs) so a decision-maker can weigh near-term engagement against discovery and long-term value.
**Deliverables:** (a) a metric spec with formulas; (b) an observational analysis plan with controls and diagnostics; (c) an experiment-design doc with randomization unit, power inputs, interference mitigations, and stopping rules; (d) KPIs quantifying the incremental value of Unconnected content even when near-term engagement is lower.
Quick Answer: A Meta data-scientist technical-screen question on Analytics & Experimentation: using only info_stream_views and post_reactions, prove whether Friend-authored content is 'more social' than Unconnected content. It tests denominator-complete metric design with relationship attribution, observational causal validation (fixed effects + propensity/AIPW), and a network-aware rollout experiment with power, CUPED, interference mitigations, and guardrails — plus quantifying the long-term and discovery value of Unconnected content beyond near-term engagement.