Design analysis to test social vs game engagement
Company: Meta
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: Medium
Interview Round: Onsite
##### Question
**Hypothesis:** Among Oculus (Meta Quest) users, those who use *social* features are more regularly engaged than those who use *game* features. Using activity data over an ~8-week window (e.g., 2025-07-01 to 2025-08-31), design a rigorous analysis plan to evaluate this claim. Answer all parts:
1. **Define the outcome.** Define "regularly engaged" precisely (e.g., ≥ 3 active days per week over the 8-week window, or ≥ 10 active days in 28 days). Choose a primary metric and 2–3 guardrail metrics. Justify each choice and propose an analysis-ready metric definition that is robust to outliers and seasonality.
2. **Ideal randomized experiment.** If you can randomize, specify the exact experiment: unit of randomization, treatment (e.g., an onboarding nudge that shifts first-week exposure toward social vs game), primary metric, guardrails, stratification variables, and how you will prevent contamination and novelty effects. State the power/MDE target at α = 0.05.
3. **Observational fallback (causal inference).** If randomization is infeasible, propose an observational design comparing *social-only* vs *game-only* users. Specify inclusion/exclusion criteria (minimum tenure, geography, device), null/alternative hypotheses, and a causal approach (propensity score matching/weighting, IPW, or exact matching on tenure buckets). List the covariates to control (signup date / tenure, baseline engagement, device, country, acquisition channel, content supply, weekday mix) and the diagnostics to validate overlap and balance.
4. **Power / sample size.** Suppose among game-only users the baseline share that is "regularly engaged" is p0 = 0.35. You want to detect a +3 percentage-point absolute lift (MDE = 0.03) with α = 0.05 (two-sided) and power 1 − β = 0.80. Compute the required sample size per group for a two-proportion Z-test and state all formulas and assumptions.
5. **Estimation and inference.** Describe the primary estimator and statistical test (e.g., difference in proportions with cluster-robust SEs if randomization is by user; a two-proportion z-test for the regular-week share; Welch's t-test or Mann–Whitney for mean weekly active days; or a logistic regression with covariates and robust SEs). Explain how you would handle multiple comparisons and interim looks.
6. **Bias and robustness checks.** Detail checks for bias and robustness: pre-trend checks, difference-in-differences on users who switch categories, placebo outcomes, sensitivity analysis for unobserved confounding (e.g., Rosenbaum bounds), and multiple-testing control for secondaries. List at least five concrete threats to validity (reverse causality — more engaged users self-select into social; category misclassification; taxonomy drift; bots / multi-accounts; seasonality; geographic shocks) and how you would detect/mitigate each.
7. **Decision-making and communication.** Define a clear decision rule combining statistical significance, a practical-significance threshold, and guardrails (e.g., p < 0.05 *and* lift ≥ X% with a CI excluding 0). Explain how you would communicate results, risks, and assumptions to product stakeholders, and what follow-up you would run if the effect is heterogeneous across tenure cohorts.
Quick Answer: A Meta (Oculus) data science onsite question on designing a rigorous study to test whether social-feature users are more regularly engaged than game-feature users. It covers metric definition, a randomized experiment, an observational causal-inference fallback (PSM/IPW), a two-proportion power calculation, estimation and inference, robustness/bias checks, and a decision rule.
Solution
**1) Defining "regularly engaged" and metrics**
Meta's Oculus / Quest is a VR hardware platform, so "engagement" means recurring headset sessions, not web visits. Define the outcome as a clear, time-windowed binary so it powers a two-proportion test:
- **Primary outcome:** `regularly_engaged` = user has ≥ 3 active days/week in ≥ 6 of the 8 weeks (or equivalently ≥ 10 active days in any 28-day sub-window). A binary, windowed definition is robust to single heavy-use days and is the cleanest thing to power.
- **Guardrails:** (a) median session length / total minutes (catches a case where "social" inflates day-counts with trivially short sessions), (b) 4-week retention / churn, (c) crash or comfort-related opt-outs (a VR-specific health/safety guardrail).
- **Robustness:** winsorize continuous metrics (e.g., minutes) at the 99th percentile; define active days in the user's local timezone; compare like-for-like calendar weeks to neutralize weekday/holiday seasonality; require a minimum tenure so brand-new users' onboarding spike doesn't dominate.
**2) Ideal randomized experiment**
- **Unit of randomization:** the user (account). User-level randomization avoids cross-user spillover and matches how exposure is delivered.
- **Treatment:** an onboarding / home-screen nudge that shifts first-week exposure toward social features (treatment) vs toward games (control) — this manipulates the *category* of early exposure, which is the lever we can actually pull. (You cannot randomize a user's pre-existing preference, so randomize the nudge, not the trait.)
- **Primary metric:** `regularly_engaged` measured in weeks 2–8 (post-exposure), to avoid mechanically counting the nudged sessions themselves.
- **Stratification:** device generation (Quest 2 vs 3 vs Pro), country/region, and baseline activity tier — stratified randomization tightens the estimate and guarantees balance on the strongest predictors.
- **Guardrails:** session length, retention, and comfort opt-outs as above.
- **Contamination / novelty:** randomize at the user level and keep a fixed assignment for the whole window to prevent cross-arm leakage; reserve a hold-out and inspect the time series so a short-lived novelty bump isn't mistaken for durable lift; pre-register the analysis to avoid peeking.
**3) Observational fallback — causal inference**
If you cannot randomize, compare social-only vs game-only users with explicit confounder control.
- **Inclusion/exclusion:** minimum tenure (e.g., headset ≥ 30 days before the window so onboarding is excluded), supported regions/locales only, exclude flagged bots / shared family accounts, restrict to active devices.
- **Hypotheses:** H0: P(regularly_engaged | social) = P(regularly_engaged | game); H1: P(social) > P(game) (one-sided if directional).
- **Causal method:** estimate a propensity for "chooses social" from pre-window covariates — signup_date / tenure, baseline activity in a pre-period, device generation, country, acquisition channel, content library size/supply, weekday mix — then **PSM or IPW**, optionally with **exact matching on tenure buckets**.
- **Diagnostics:** check **common support / overlap** (trim non-overlapping propensity regions), verify **covariate balance** after weighting/matching (standardized mean differences < 0.1), and report effective sample size after weighting.
**4) Power / sample size (two-proportion Z-test)**
For a two-sided two-proportion test with equal groups, p0 = 0.35, p1 = p0 + MDE = 0.38, α = 0.05, power = 0.80:
```
n_per_group = ( z_{1-α/2}·√(2·p̄(1-p̄)) + z_{1-β}·√(p0(1-p0)+p1(1-p1)) )² / (p1-p0)²
```
with z_{0.975} = 1.95996, z_{0.80} = 0.84162, and p̄ = (p0+p1)/2 = 0.365.
- p̄(1−p̄) = 0.365·0.635 = 0.231775 → √(2·0.231775) = √0.46355 ≈ 0.68085, times 1.95996 ≈ 1.33444.
- p0(1−p0) = 0.35·0.65 = 0.2275; p1(1−p1) = 0.38·0.62 = 0.2356; sum = 0.4631 → √0.4631 ≈ 0.68051, times 0.84162 ≈ 0.57273.
- Numerator = (1.33444 + 0.57273)² = (1.90717)² ≈ 3.63730.
- Denominator = (0.03)² = 0.0009.
- **n_per_group ≈ 3.63730 / 0.0009 ≈ 4,042**, i.e. ~4,050 per arm (~8,100 total).
Assumptions: independent users (one observation each), equal allocation, fixed effect size, no interim peeking; if randomizing by user but analyzing user-weeks, inflate by a design effect for clustering. A simpler pooled approximation, n ≈ (z_{1-α/2}+z_{1-β})²·2·p̄(1−p̄)/MDE², gives ~4,015 and is fine as a sanity check.
**5) Estimation and inference**
- **Primary estimator:** difference in the `regularly_engaged` proportions; test with a two-proportion z-test (or a logistic regression on the treatment indicator). With user-level randomization and one row per user, ordinary SEs suffice; if you analyze repeated user-weeks, use **cluster-robust SEs by user**.
- **Continuous secondaries** (mean weekly active days, minutes): **Welch's t-test** (unequal variances) or **Mann–Whitney** if heavily skewed.
- **Covariate adjustment** (observational or for precision): logistic regression with the propensity covariates and robust SEs, or IPW-weighted estimation.
- **Multiple comparisons:** control the family-wise error or FDR (Bonferroni / Benjamini–Hochberg) across guardrails and secondaries.
- **Interim looks:** use alpha-spending (O'Brien–Fleming / Pocock) or sequential testing; don't peek with a fixed α.
**6) Bias and robustness**
- **Pre-trend checks:** confirm social vs game cohorts had parallel engagement *before* the window (a divergence pre-treatment breaks the causal read).
- **Difference-in-differences on switchers:** for users who move between categories, compare before/after, differencing out fixed user traits.
- **Placebo outcomes:** test an outcome that should *not* be affected (e.g., account-settings visits); a "significant" placebo effect signals residual confounding.
- **Sensitivity analysis:** **Rosenbaum bounds** — how large an unobserved confounder would have to be to overturn the result.
- **Threats to validity (≥ 5):** (i) reverse causality — already-engaged users self-select into social, not the reverse; address with the randomized design or pre-period matching. (ii) category misclassification / taxonomy drift — audit the social/game labels and freeze the taxonomy for the window. (iii) bots / multi-accounts / family-shared headsets — filter via device fingerprint and activity heuristics. (iv) seasonality and content shocks (a hit game launch) — align calendar weeks, add time fixed effects, inspect by-week trends. (v) geographic / supply shocks — stratify by region and check robustness excluding affected markets. (vi) novelty effects — measure the durable post-exposure window, not the nudge week.
**7) Decision-making and communication**
- **Decision rule:** ship/conclude only if the primary effect is **both statistically significant (p < 0.05, CI excluding 0)** *and* **practically significant** (lift ≥ a pre-set threshold, e.g. +3 pp), **and** no guardrail regresses beyond its bound.
- **Communication:** lead with the estimated lift and its CI in plain terms, state the design (randomized vs observational) and its key assumptions, and be explicit about residual confounding risk if observational.
- **Heterogeneity follow-up:** if the effect varies by tenure cohort (e.g., strong for new users, flat for veterans), report the interaction, avoid over-generalizing the average, and propose a targeted follow-up experiment on the responsive segment.
Explanation
Rubric: a strong answer treats this as a causal question, not a correlational one. It (1) pins "regularly engaged" to a windowed binary that can be powered; (2) proposes randomizing the *exposure nudge* (you can't randomize a pre-existing preference) with user-level assignment and stratification; (3) falls back to PSM/IPW with named confounders, overlap and balance diagnostics; (4) correctly applies the two-proportion sample-size formula (~4,000–4,050/arm here); (5) picks tests that match the data type and controls multiplicity/interim looks; (6) defends against the dominant threat — reverse causality / self-selection — plus misclassification, bots, seasonality, with placebo + sensitivity checks; and (7) sets a decision rule requiring both statistical and practical significance with intact guardrails. Red flags: claiming causation from a raw social-vs-game comparison, ignoring self-selection, or no power math.