Discuss resume projects under pressure
Company: TikTok
Role: Machine Learning Engineer
Category: Behavioral & Leadership
Difficulty: hard
Interview Round: Onsite
Walk me through four projects on your resume. For each: define the problem and constraints, describe your specific role and ownership, explain the architecture and key technical decisions, compare against at least one alternative you considered, articulate trade-offs, and provide measurable outcomes (latency, cost, reliability, revenue, etc.). If challenged that “this approach performs poorly” or “we wouldn’t do this,” defend your choices with data and discuss what you would change in hindsight.
Quick Answer: This question evaluates technical ownership, systems thinking, experimental measurement, and leadership communication by requiring structured walkthroughs of multiple machine learning projects under constraints, and is commonly asked to verify an interviewee's ability to articulate technical decisions, trade-offs, and measurable impact under pressure within the Behavioral & Leadership category for a Machine Learning Engineer role. It tests domain knowledge in machine learning systems, data engineering, and software reliability at a blend of conceptual understanding and practical application, emphasizing justification of design choices, alternatives considered, and quantifiable outcomes rather than implementation details.
Solution
Below is a practical framework to structure your answer, followed by four exemplar projects tailored to a Machine Learning Engineer role. Use the framework for your own experiences; the examples illustrate depth, metrics, and trade-offs.
## Answer Framework (use for each project)
- Problem & Constraints: 1–2 sentences on the user/business problem and hard constraints (latency/QPS, cost, privacy, reliability, launch date).
- Role & Ownership: Your scope, decisions, and leadership (design, model/infra choices, rollout, cross-team work).
- Architecture & Key Decisions: Data flow (ingest → feature → model → serving), major components, and why.
- Alternatives & Trade-offs: One or two alternatives; compare via pros/cons and data.
- Results: Quantify outcomes (e.g., +3.2% watch-time; p95 –18 ms; cost –12%). Include confidence/variance if tested.
- Defense & Hindsight: Defend choices with evidence; call out what you’d do differently next time.
---
## Project 1: Home Feed Ranking v2 (Two-Stage Retrieval + DLRM Ranker)
1) Problem & Constraints
- Problem: Watch-time and CTR plateaued; cold-start complaints increasing.
- Constraints: p95 latency ≤ 80 ms for ranking; >120k QPS; model memory ≤ 4 GB per host; cost-neutral; maintain reliability (p99 errors < 0.1%).
2) Role & Ownership
- Tech lead for 4 engineers. Owned model design, feature pipeline, serving optimization, and canary→ramp rollout. Partnered with data platform for feature store integration.
3) Architecture & Key Decisions
- Two-stage system:
- Candidate retrieval: Two-tower model (user/item embeddings), ANN index (ScaNN), sampled softmax training. ~1K candidates/user.
- Ranking: DLRM (dense user/session stats + sparse IDs via embeddings). Feature hashing + frequency capping to limit vocab; INT8 quantization with QAT; TensorRT serving with dynamic batching.
- Re-ranking: Diversity via MMR to reduce near-duplicate items.
- Feature Store: Single definition used offline/online to reduce skew; point-in-time joins.
- Serving: CPU inference for ranker with AVX2; ANN on GPU for throughput; budget split (ANN ~20 ms, ranker ~50 ms, re-rank ~5 ms).
4) Alternatives & Trade-offs
- Alternative A: XGBoost + engineered crosses. Pros: interpretability, fast training; Cons: plateaued NDCG, harder to scale sparse IDs.
- Alternative B: Wide & Deep. Pros: lightweight; Cons: underfit long-tail interactions vs DLRM.
- Data:
- Offline: DLRM +0.8 pp AUC vs XGBoost; +1.6% NDCG@50.
- Online: +3.2% watch-time/user, +2.1% session length vs control; no statistically significant churn impact.
- Trade-offs: Quantization saved 12–15% compute at small (~0.1 pp) AUC loss; accepted due to latency/cost.
5) Results (Measurable)
- Latency: p95 from 92 ms → 74 ms (–18 ms); p99 from 140 ms → 120 ms.
- Cost: –12% per-1k requests via INT8 + batching.
- Reliability: p99 error rate 0.08% (SLA < 0.1%).
- Business: +3.2% watch-time; +1.8% creator exposure diversity (entropy metric).
6) Defense & Hindsight
- Challenge: “Use a transformer ranker; DLRM is simplistic.”
- Defense: We prototyped a 2-layer transformer ranker: +0.2% watch-time but +25 ms p95 and +22% compute cost; net not justified under 80 ms SLA.
- Hindsight: Would invest earlier in calibrated scores (isotonic) to improve downstream re-ranking and fairness; and in smarter negative sampling for long-tail items.
Small numeric check: With 120k QPS and 74 ms p95, concurrent in-flight reqs ≈ 120k × 0.074 ≈ 8.9k per DC; capacity planning sized to 60% headroom.
---
## Project 2: Real-Time Toxicity Classifier for Comments/Live Chat
1) Problem & Constraints
- Problem: High rate of abusive content reports; moderation needed before display when possible.
- Constraints: End-to-end decisioning < 50 ms; classifier p95 < 20 ms; false-positive rate (FPR) < 0.5% to avoid over-censorship; multilingual (top 10 languages); CPU-only in edge PoPs.
2) Role & Ownership
- Model lead. Built teacher–student pipeline, active learning loop, lexicon/rule fallback, and streaming inference (ONNX Runtime + quantization). Coordinated with Trust & Safety and localization.
3) Architecture & Key Decisions
- Teacher: Multilingual BERT fine-tuned with human-labeled data (abuse/harassment/hate), focal loss to handle class imbalance.
- Student: Distilled TinyBERT (6 layers → 4 layers), dynamic max length by language; INT8 quantization.
- Fallback rules: Curated lexicons/regex for high-precision slurs (guardrail for recall).
- Active learning: Uncertainty sampling + diversity; weekly labeling sprints.
- Calibration: Temperature scaling per language; thresholds tuned to meet FPR caps.
4) Alternatives & Trade-offs
- Alternative A: TF-IDF + Logistic Regression. Pros: ultra-fast (<2 ms); Cons: poor recall on paraphrases/codewords.
- Alternative B: Full mBERT in prod. Pros: best F1; Cons: ~35–40 ms p95 CPU; violates SLA at peak.
- Trade-offs: Student model sacrificed ~1.5 pp F1 vs teacher but achieved 18 ms p95; lexicon fallback recovered ~0.4 pp precision with +0.2 ms latency.
5) Results (Measurable)
- Safety: Abusive comment incidence (reports per 10k views) –28% (p < 0.01).
- Latency: 18 ms p95, 30 ms p99.
- Cost: ~$0.42 per million inferences (CPU, quantized, batch=4 micro-batching).
- Appeals: No significant increase; successful appeal rate steady at ~0.3%.
6) Defense & Hindsight
- Challenge: “Rules are brittle; don’t combine with ML.”
- Defense: A/B showed lexicon+ML reduced false negatives by 6% on explicit slurs with negligible latency; rules only trigger on high-precision patterns, never alone for nuanced cases.
- Hindsight: Earlier investment in adversarial training (obfuscations, homoglyphs) and per-community adaptive thresholds would further reduce misses.
Small numeric note: If FPR target is 0.5% and daily volume is 200M comments, max wrongful blocks/day ≈ 1M; with our tuned thresholds we observed ~0.32% FPR → ~640k/day, mitigated by soft actions (downrank/quarantine) vs hard blocks.
---
## Project 3: Unified Feature Store (Offline–Online Consistency)
1) Problem & Constraints
- Problem: Training–serving skew incidents causing metric regressions; duplicated feature logic; slow iteration.
- Constraints: Point-in-time correctness; online fetch p99 < 10 ms; backfills over 200 TB; GDPR/retention compliance; high availability (three 9s).
2) Role & Ownership
- Co-designed system with data platform. Owned feature spec DSL, point-in-time join library, online store schema, and SLAs/monitoring. Drove adoption across 5 ML teams.
3) Architecture & Key Decisions
- Registry & DSL: Single feature definition (SQL + UDFs) compiled to offline (Spark) and online (Flink) jobs.
- Offline Store: Parquet in data lake; time-partitioned; backfill tooling with watermark enforcement.
- Online Store: Redis cluster (replicated) with TTLs and freshness indicators; write-ahead log for recovery.
- Point-in-time Joins: Prevent leakage via as-of joins; training data generated with event-time only.
- Monitoring: Staleness SLOs, schema drift alerts, feature null-rate/KS drift dashboards.
4) Alternatives & Trade-offs
- Alternative A: Per-team bespoke pipelines. Pros: speed initially; Cons: skew bugs, duplicated effort.
- Alternative B: Managed vendor. Pros: faster boot; Cons: data residency and custom UDF constraints; higher per-GB cost.
- Trade-offs: Slightly higher storage cost (+8%) for dual stores vs large reduction in outages and cycle time.
5) Results (Measurable)
- Reliability: Skew incidents –80% (from 10/quarter → 2/quarter); MTTR from days → hours.
- Speed: Model iteration cycle 14 days → 6 days (–57%).
- Performance: Online fetch p99 6 ms; p50 1.7 ms.
- Cost: +8% storage; –15% engineer time on data plumbing (survey/time-tracking).
6) Defense & Hindsight
- Challenge: “Over-engineered for our size.”
- Defense: Incident cost analysis showed one skew incident cost ~1 week of multi-team effort; breakeven at ~3 prevented incidents/year. We prevented >6/year.
- Hindsight: Start with narrower feature domains to accelerate adoption; add row-level ACLs and lineage UI earlier for compliance and auditability.
Small example: Point-in-time join guardrail: given impression at t=10:00, a future click at t=10:05 must not appear in training features. Our library enforces feature_time ≤ label_time – ε.
---
## Project 4: Notification Ranking via Uplift Modeling (Reduce Spam, Preserve Retention)
1) Problem & Constraints
- Problem: High notification volume drove opt-outs and user fatigue; need to send fewer, more impactful notifications.
- Constraints: Per-user daily cap; decision latency < 30 ms; safety guardrails (no sensitive categories at night); business goal: maintain or improve D1/D7 retention.
2) Role & Ownership
- Led modeling and policy layer. Owned treatment policy, offline evaluation, and online experimentation with guardrails. Collaborated with messaging infra and policy teams.
3) Architecture & Key Decisions
- Modeling: Two-model uplift (T-learner): train E[Y|T=1,x] and E[Y|T=0,x]; uplift u(x)=μ1(x)–μ0(x). Targets: next-day return and downstream engagement.
- Debiasing: Inverse propensity weighting using historical randomized buckets; CUPED baseline features to reduce variance.
- Decisioning: Select top-K notifications per user by uplift subject to caps and safety constraints (simple knapsack/greedy).
- Evaluation: Offline AUUC/Qini; online A/B with sequential testing and guardrails (opt-out rate, complaint rate).
- Serving: Precompute candidate scores hourly; lightweight online re-scoring with context; p95 < 25 ms.
4) Alternatives & Trade-offs
- Alternative A: CTR model. Pros: stable estimates; Cons: maximizes clicks, not incremental value → caused send inflation.
- Alternative B: Heuristic throttling (per-user cooldown). Pros: simple; Cons: leaves value on table and brittle across cohorts.
- Trade-offs: Uplift models are higher variance; mitigated with shrinkage (Bayesian ridge on uplift) and minimum sample thresholds per template.
5) Results (Measurable)
- Volume: –24% notifications sent/user/day.
- User health: Opt-outs –17%; complaint rate –12%.
- Retention: +0.6 pp D7 retention (statistically significant); revenue neutral.
- Latency: 22 ms p95 decisioning; SLA met.
6) Defense & Hindsight
- Challenge: “Uplift is too noisy to trust.”
- Defense: We used randomized holdouts for calibration; AUUC improved by +11%; online showed stable gains across cohorts. Guardrails prevented regressions.
- Hindsight: A simpler CTR + fatigue penalty could have shipped faster (80% of benefit); next iteration would use contextual bandits for continuous exploration with per-user uncertainty.
Small numeric example: If propensity p(T=1|x)=0.2 and observed click y=1 under treatment, IPW contribution is y/p=1/0.2=5; control y=0 contributes 0/(1–0.2)=0. CUPED reduced variance by ~25%, shortening test duration by ~30%.
---
## Final Tips
- Keep each project to 2–3 minutes; emphasize metrics and constraints.
- Anticipate challenges (latency, cost, bias, safety) and have data ready.
- When you don’t have exact numbers, use ranges and explain measurement methods (A/B, CUPED, sequential tests, confidence intervals).
- Always close with what you’d change in hindsight to show learning and leadership.