Design experiment for fake accounts impact
Company: Meta
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: hard
Interview Round: Onsite
Your team will remove detected fake accounts and wants to estimate causal impact on real users' experience. Design an end-to-end experiment plan that addresses network interference and product trade-offs.
Be specific:
1) Experiment unit and randomization: Choose and justify between user-level, ego-network cluster, or geography-level randomization. Describe how you would construct clusters to minimize cross-treatment contamination while maintaining power.
2) Primary and guardrail metrics: Specify exact definitions (e.g., comments_per_view, 7-day retention of real accounts, abuse reports per 1K views). Define metric windows and whether they are exposure- or calendar-based.
3) Power and duration: Provide a concrete back-of-envelope sample-size calculation assuming a 0.5% relative change in comments_per_view, baseline 0.12, overdispersion, and an intra-cluster correlation of 0.02.
4) Interference diagnostics: Propose two separate tests to quantify spillovers (e.g., ghost exposure analysis for users connected to treated removals; edge-cut A/A). Define the expected null.
5) Noncompliance and misclassification: Detection is imperfect. Outline an IV or CUPED/DID approach to recover LATE using removal intensity as an instrument; list assumptions and falsification checks.
6) Ramp and ethics: Define a staged rollout with kill-switch criteria using guardrails (e.g., creator reach drop >1% with p<0.05). Include how you will prevent label leakage in feeds and notifications.
7) Beyond experimentation: If randomization is infeasible in some markets, provide a quasi-experimental backup (synthetic control or staggered DID) and specify exact covariates required from logs. Conclude with a product recommendation you might make if engagement dips short-term but abuse reports drop materially.
Quick Answer: This question evaluates a data scientist's competency in experimental design and causal inference for networked social platforms, including unit and cluster randomization choices, metric selection and exposure windows, power estimation, interference diagnostics, treatment misclassification handling, ethical ramping, and quasi-experimental backups.