Design an ML pipeline that generates search keyword recommendations for an app marketplace. Given a query like "games," produce diverse, typed suggestions (e.g., genres such as puzzle, RPG, racing) with high relevance and coverage. Specify objectives and constraints (relevance, diversity, freshness, latency, privacy). Detail data sources (query/search logs, clicks, installs, uninstalls, app metadata and taxonomy, reviews, co-search/co-click graphs, embeddings, locale signals) and labeling/feedback strategies. Propose the system architecture: candidate generation and ranking stages, feature store, offline training, online serving, cache, and retrieval. Describe features (text/semantic embeddings, popularity/recency, user/context signals, co-occurrence, graph features, quality/spam signals). Compare model options (BM25/ANN retrieval, two-tower retrieval, gradient-boosted trees, pairwise/listwise rankers, sequence models, graph models) and justify choices. Define evaluation metrics and experimentation (CTR, install rate, coverage, diversity, precision/recall, latency/errors; A/B testing and guardrails). Explain online/continual training after launch (streaming feedback ingestion, feature freshness, update cadence, warm-starting, drift detection, rollback). Discuss handling cold start, multilingual/locale variants, spam/abuse, and fairness.

Design an ML keyword recommendation pipeline evaluates ML product requirements, data/labeling, modeling, serving architecture, evaluation, monitoring, and trade-offs in a realistic interview setting. A strong answer states assumptions, handles edge cases, explains trade-offs, and shows how to validate the result clearly.

How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a hard difficulty ML System Design question, commonly asked during Technical Screen rounds at Apple.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Apple during technical interviews.

Design an ML keyword recommendation pipeline

ML System Design: Typed Search Keyword Recommendations for an App Marketplace

Goal

Design an end-to-end ML pipeline that, given a user query (e.g., "games"), generates diverse, typed keyword suggestions (e.g., "puzzle games", "RPG games", "racing games") with high relevance and coverage.

Assume you are designing for a large-scale app marketplace with millions of users and tens of thousands of queries per second during peak. Typed suggestions are grounded in a controlled taxonomy (e.g., Genre, Feature, Price, Age, Mode) and must be compliant with marketplace policies.

Requirements

Objectives and constraints

Relevance, diversity, coverage
Freshness/trends, multilingual/locale correctness
Latency and availability SLOs
Privacy and policy compliance

Data sources and labeling/feedback

Query/search logs, clicks, installs, uninstalls
App metadata and taxonomy
Reviews text, co-search/co-click graphs
Embeddings, locale signals
Labeling: implicit feedback (CTR/installs), counterfactual debiasing, editorial seeds

System architecture

Candidate generation: lexical, semantic (ANN), taxonomy, graph/co-occurrence, trending
Ranking: multi-stage (LTR + neural), diversity-aware re-rank
Feature store (offline/online), offline training, online serving, cache, retrieval indices

Features

Text/semantic embeddings, lexical features
Popularity/recency/trending signals
User/context signals (locale, device)
Co-occurrence/graph features (PMI, P(s|q))
Quality/spam trust signals

Models and choices

Retrieval: BM25, two-tower ANN, graph-based expansion
Ranking: GBDT, pairwise/listwise LTR, cross-encoder re-ranker, optional sequence/graph models

Evaluation and experimentation

Metrics: CTR, install rate, NDCG, recall@K, coverage/diversity, latency/errors, calibration
A/B testing with guardrails and statistical rigor

Continual training/ops

Streaming feedback ingestion, feature freshness, update cadence
Warm-starting, drift detection, rollback

Special cases

Cold start, multilingual/locale variants, spam/abuse, fairness and policy

Constraints & Assumptions

Preserve the scope, facts, inputs, and requested outputs from the prompt above.
If the prompt leaves a detail unspecified, state a reasonable assumption before relying on it.
Keep the answer interview-ready: concise enough to present, but concrete enough to implement or evaluate.

Clarifying Questions to Ask

Clarify users, core use cases, read/write patterns, scale, latency, availability, and data retention.
State explicit assumptions before making sizing or architecture decisions.
Prioritize the functional path first, then address reliability, security, observability, and rollout.

What a Strong Answer Covers

A scoped requirements summary with concrete non-goals and success metrics.
ML-specific data, model, evaluation, serving, and monitoring choices.
Reasoned trade-offs among simple and scalable designs, including bottlenecks and failure modes.
A validation, monitoring, migration, and launch plan appropriate for the risk level.

Follow-up Questions

What breaks first at 10x traffic or data volume?
How would you degrade gracefully during dependency failures?
What metrics and alerts would prove the design is healthy after launch?