Design bot detection and evaluate trade-offs
Company: Meta
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Onsite
You must design and evaluate a bot-detection system for comment activity. Address:
1) Labeling strategy with minimal ground truth: propose weak-supervision heuristics, manual review sampling plans, and how you’d de-bias labels given extreme class imbalance (e.g., <0.5% bots).
2) Features across time scales: per-session burstiness, inter-comment intervals, entropy of targets, language signals, graph-based reciprocity; specify which must be real-time vs. batch.
3) Model choice and calibration: compare linear, tree ensembles, and sequence models; how to calibrate posteriors (Platt/Isotonic) and monitor calibration drift.
4) Thresholding by cost: define costs for FP (blocking a human) vs FN (missing a bot) and pick an operating point using precision-recall curves; compute expected blocked-human-minutes at a chosen threshold given example rates.
5) Adversarial robustness: features least gameable, canaries, and drift detection.
6) Online safety net: shadow mode, backfill re-scoring, human review queues.
7) Evaluation: offline PR-AUC/recall@high-precision; online guardrails (reports, creator retention, comment latency). 8) If FP becomes high, trace the root cause with error analysis and propose a rollback/ramp strategy.
Quick Answer: This question evaluates a data scientist's competency in designing operational machine-learning systems for bot detection, covering labeling and weak supervision, temporal and graph-based feature engineering, model choice and calibration, cost-based thresholding, adversarial robustness, and online monitoring and safety nets.