System Design: Incorporating Large Language Models (LLMs) into a Large-Scale Recommendation System
Context
You are designing enhancements for a high-throughput, mobile-first recommendation system that serves a mixed-media feed (short videos, images, text, live). The system must operate under tight latency and cost budgets, handle multi-lingual content, and meet strong safety/moderation requirements.
Task
Outline how to incorporate LLMs end-to-end, covering:
-
Use cases
-
Item/user metadata enrichment
-
Query and intent understanding (search, natural-language instructions)
-
Cold-start handling (items and users)
-
Generative retrieval
-
Semantic reranking
-
Explanations/justifications
-
Multi-modal recommendations
-
Architectures
-
LLM as feature generator (mostly offline)
-
LLM as reranker (online, top-K)
-
LLM as agent/orchestrator (tools + policies)
-
Online/offline placement and caching strategies
-
What runs offline vs online; what to cache and how
-
Latency and cost constraints
-
Budgets, fallbacks, distillation/quantization, traffic shaping
-
Safety and content filtering
-
Moderation, prompt hardening, PII/fairness guardrails
-
Evaluation plans
-
Offline: metrics, ablations, IPS/counterfactual evaluation, quality checks
-
Online: A/B tests, guardrails, feedback loops, monitoring
Provide concrete design choices, resource estimates, and guardrails suitable for a technical screening interview.