Build a model to infer home vs office vs public
Company: Meta
Role: Data Scientist
Category: Machine Learning
Difficulty: Medium
Interview Round: Technical Screen
You must infer whether a Facebook session’s network context is home, office, or public venue to inform Portal targeting. Constraints: IPs may be shared (NAT), dynamic, or CGNAT; households have multiple users; only privacy‑preserving telemetry is allowed (timestamps, coarse geolocation, ASN/ISP, device/app vs web, session lengths, concurrent sessions, contact‑graph features). Today is 2025-09-01. Build an ML approach:
1) Features: propose robust, leak‑free features capturing diurnal/weekly patterns, ISP/ASN type (residential vs enterprise vs mobile), IP stability, geolocation drift, concurrent user counts on the same IP, session inter‑arrival, device/browser/OS mix, reverse DNS hints, and calling‑graph closeness (e.g., kin vs coworker patterns). Explain how to handle apartments sharing a router and coffee‑shop Wi‑Fi.
2) Labels: design weak‑supervision strategies to obtain labels at scale (e.g., overnight dwell heuristics, business‑hours rules, known corporate ASNs, opted‑in seed users, store‑IP blacklists). Describe how you will de‑bias noisy labels.
3) Modeling: compare baseline rule lists vs gradient‑boosted trees vs sequence models (e.g., per‑IP HMM or transformer over events). Consider multi‑instance learning to aggregate session‑level predictions to user/household. Explain calibration and thresholding for asymmetric costs (misclassifying office as home).
4) Evaluation: define metrics (macro F1, expected cost), cross‑geo temporal CV, and backtests across holidays. Prevent leakage from future behavior and from using Portal adoption as a proxy. Quantify uncertainty.
5) Privacy/compliance: specify minimization, aggregation, retention, on‑device inference options, and red‑teaming for re‑identification risks.
6) Deployment: outline real‑time vs batch inference, drift monitoring, and a holdout plan to measure whether location‑type targeting improves conversion.
Quick Answer: This question evaluates a data scientist's applied machine learning competencies including privacy-preserving feature engineering, weak‑supervision labeling, model selection and calibration, uncertainty quantification, and operational deployment for inferring session network context (home vs office vs public) from telemetry.