Build and evaluate illegal-video classifier
Company: Google
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
Design an end‑to‑end system to flag illegal YouTube videos.
- Data: videos with titles/descriptions/captions/thumbnails; sparse, noisy labels; strong class imbalance; evolving policies.
- Modeling: choose architectures (vision, audio, text; multimodal fusion), pretraining/embeddings, and a strategy for weak supervision and active learning.
- Evaluation: define offline metrics (AUROC, PR‑AUC, calibration, cost‑weighted utility), thresholding for triage tiers, and how to build a reliable test set that resists leakage, near‑duplicates, and distribution shift.
- Safety/abuse: adversarial evasion, fairness/false‑positive harms, appeals workflow, and human‑in‑the‑loop review throughput constraints.
- Online: rollout plan (shadow mode, canary, interleaving with human rules), counterfactual risk via IPS/DR, and experiment design to measure reduction in policy violations without introducing selection bias.
Quick Answer: This question evaluates competency in end-to-end Machine Learning system design, including multimodal modeling (vision, audio, text), data engineering for sparse, noisy, and imbalanced labels, robustness and abuse resistance, human-in-the-loop workflows, privacy/retention concerns, and operational metrics.