Design query generation to maximize CTR
Company: TikTok
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Technical Screen
##### Question
Design an end-to-end query-generation system that maximizes click-through rate (CTR) for a large-scale search/recommendation product. Walk through your design and address the following:
1. **Goals, constraints, and metrics.** Define the objective (and how you measure CTR), latency/throughput SLAs, availability, multilingual/multi-locale scope, and guardrail metrics (search success, abandonment, safety, privacy).
2. **Overall architecture.** Sketch the request path and the offline/streaming data platform end to end (diagrams or pseudocode for the critical components are welcome).
3. **Data ingestion and labeling.** Event schema for impressions/clicks, attribution windows, positive/negative labeling, position-bias and propensity logging, and data-quality/ETL handling.
4. **Feature engineering.** Text, embedding, popularity/recency, personalization, context, and presentation features, with online/offline feature parity.
5. **Candidate generation (recall).** Multiple complementary sources (lexical/prefix, popular/trending, embedding/ANN, co-click graph, generative).
6. **Ranking models.** Multi-stage ranking, loss/objective choices (pointwise/pairwise/listwise), calibration, and re-ranking with diversity/safety constraints.
7. **Model training and serving.** Training pipeline, retraining cadence, and the real-time serving stack (feature store, encoders, ANN, ranker, orchestrator) under the latency budget.
8. **Exploration–exploitation, feedback loops, and debiasing.** Bandit strategy (e.g., Thompson Sampling / LinUCB), propensity logging, and correction for position/selection/popularity bias.
9. **Evaluation.** Offline metrics and counterfactual evaluation (IPS/SNIPS/DR) plus online A/B testing methodology (splits, power, guardrails, ramp).
10. **Personalization, cold start, and diversity.** Personalization signals, new-user/new-query/new-market cold start, and diversity/coverage controls.
11. **Safety/abuse and privacy compliance.** Blocklists, classifiers, PII handling, jurisdictional policy, and privacy/compliance.
12. **Scalability, reliability, and post-launch iteration.** How you shard and degrade gracefully, and how you would iterate after launch.
Quick Answer: Design query generation to maximize CTR evaluates ML product requirements, data/labeling, modeling, serving architecture, evaluation, monitoring, and trade-offs in a realistic interview setting. A strong answer states assumptions, handles edge cases, explains trade-offs, and shows how to validate the result clearly.