Design an LLM-Based Conversational Assistant (Chatbot)
Company: Meta
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Onsite
## Design an LLM-Based Conversational Assistant (Chatbot)
In a research-design round you are asked to design the model and system behind a general-purpose **conversational assistant** — a multi-turn chatbot that answers questions, holds a coherent dialogue, can stay grounded in fresh or private information, and behaves safely. Cover how you would turn a base language model into a helpful, honest, and harmless assistant, how you keep it grounded and up to date, how you evaluate it, and how you serve it at scale.
### Constraints & Assumptions
- General-purpose consumer + enterprise assistant; broad-domain questions, multi-turn conversations, some tool use (search, calculator, internal knowledge bases).
- Responses are **streamed**; users expect the first token within a few hundred milliseconds and a fluent token rate after that.
- The assistant must avoid clearly harmful outputs (e.g., disallowed content), refuse appropriately, and avoid confidently stating false facts where it can be grounded instead.
- The world changes after training cutoff, and some answers depend on private/enterprise data the model never saw.
- You have a large pretraining budget and access to standard web-scale text plus the ability to collect human feedback data.
- Start from a decoder-only autoregressive transformer base model unless you argue otherwise.
### Clarifying Questions to Ask
- What is the primary objective — engagement/retention, task success, enterprise deployment, or safety-sensitive use cases — and what is the headline metric?
- What is the latency budget (first-token and tokens/sec) and the expected concurrency, since that bounds model size and serving design?
- Which capabilities are in scope for v1: pure chat, retrieval-grounded answers, tool/function calling, long-term memory across sessions?
- What is the safety bar and who owns the policy (allowed/refused content), and do we need region-specific policies?
- Do enterprise customers need their private data grounded without it entering training, and under what isolation guarantees?
- How fresh must answers be (real-time, daily, training-cutoff is fine), which decides whether we need retrieval/tools vs. periodic retraining?
### Part 1 — Requirements and base model
State the functional and non-functional requirements and choose the base architecture and size strategy. Decide what the **base (pretrained) model** must provide before any alignment, and how model size trades off against the streaming-latency budget and serving cost.
```hint Separate the two skills
"Knows things / can reason" comes from pretraining; "is a helpful, safe assistant that follows instructions" comes from post-training — keep these as distinct design stages.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Alignment: turning a base model into an assistant
Describe the post-training pipeline that makes the base model follow instructions, hold a multi-turn conversation, refuse unsafe requests, and match human preferences. Cover supervised fine-tuning and reinforcement learning from human feedback (or a preference-optimization alternative), and how you build the data for each.
```hint The standard three stages
Supervised fine-tuning on demonstration dialogues → train a reward/preference model from human comparisons → optimize the policy against it (RLHF/PPO) — or collapse the last two with a direct preference method (DPO).
```
```hint Safety is data + reward, not a filter bolted on
Refusals and harmlessness should be taught in SFT and rewarded in preference optimization (e.g., explicit harmlessness comparisons / red-team data), not only handled by an output classifier.
```
#### Clarifying Questions for this Part
- Do we have or can we collect a large pool of human labelers for demonstrations and pairwise preferences, and at what quality bar?
- Is there an existing content policy we must encode, or are we defining the allowed/refused taxonomy ourselves?
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — Grounding, freshness, tools, and memory
The base model's knowledge is frozen at training cutoff and cannot contain private data. Design how the assistant stays factual, current, and personalized: retrieval-augmented generation, tool/function calling, and any cross-session memory.
```hint Don't retrain to add a fact
For freshness and private data, retrieve relevant passages at query time and put them in context (RAG) and/or let the model call tools (search, DB, calculator) — reserve retraining for capability, not facts.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 4 — Evaluation, serving, and safety in production
Define how you measure assistant quality offline and online, how you serve streaming responses at scale within budget, and how you keep it safe and monitored in production.
```hint Quality is multi-dimensional and partly subjective
A single accuracy number doesn't capture a chatbot — combine human preference / side-by-side win rates, task-specific benchmarks, groundedness/hallucination checks, and safety red-team pass rates.
```
```hint Streaming serving levers
KV caching, continuous batching, paged attention for memory, quantization, and speculative decoding are how you hit first-token latency and high concurrency.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- Users report the assistant is "too cautious" and refuses benign requests. How do you measure over-refusal and fix it without weakening real safety?
- The model hallucinates citations even with retrieval enabled. How do you diagnose whether the failure is in retrieval, reranking, or generation, and how do you fix it?
- How would you let an enterprise ground answers in its private corpus while guaranteeing that corpus never enters your training set or another tenant's responses?
- Describe how you would A/B test a new alignment recipe in production without exposing users to a regression in safety.
Quick Answer: This question evaluates the ability to design an end-to-end LLM-based conversational assistant, covering pretraining, alignment, retrieval, and serving. It tests understanding of how base language models are turned into safe, helpful chatbots through fine-tuning and preference optimization, and how they stay factual and current through retrieval and tool use. Commonly asked in ML system design interviews to assess architectural and trade-off reasoning at a practical, applied level.