How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a hard difficulty ML System Design question, commonly asked during Onsite rounds at Meta.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Meta during technical interviews.

Design an LLM-Based Conversational Assistant (Chatbot)

Q: Design an LLM-Based Conversational Assistant (Chatbot)

This question evaluates the ability to design an end-to-end LLM-based conversational assistant, covering pretraining, alignment, retrieval, and serving. It tests understanding of how base language models are turned into safe, helpful chatbots through fine-tuning and preference optimization, and how they stay factual and current through retrieval and tool use. Commonly asked in ML system design interviews to assess architectural and trade-off reasoning at a practical, applied level.

Design an LLM-Based Conversational Assistant (Chatbot)

In a research-design round you are asked to design the model and system behind a general-purpose conversational assistant — a multi-turn chatbot that answers questions, holds a coherent dialogue, can stay grounded in fresh or private information, and behaves safely. Cover how you would turn a base language model into a helpful, honest, and harmless assistant, how you keep it grounded and up to date, how you evaluate it, and how you serve it at scale.

Constraints & Assumptions

General-purpose consumer + enterprise assistant; broad-domain questions, multi-turn conversations, some tool use (search, calculator, internal knowledge bases).
Responses are streamed ; users expect the first token within a few hundred milliseconds and a fluent token rate after that.
The assistant must avoid clearly harmful outputs (e.g., disallowed content), refuse appropriately, and avoid confidently stating false facts where it can be grounded instead.
The world changes after training cutoff, and some answers depend on private/enterprise data the model never saw.
You have a large pretraining budget and access to standard web-scale text plus the ability to collect human feedback data.
Start from a decoder-only autoregressive transformer base model unless you argue otherwise.

Clarifying Questions to Ask

What is the primary objective — engagement/retention, task success, enterprise deployment, or safety-sensitive use cases — and what is the headline metric?
What is the latency budget (first-token and tokens/sec) and the expected concurrency, since that bounds model size and serving design?
Which capabilities are in scope for v1: pure chat, retrieval-grounded answers, tool/function calling, long-term memory across sessions?
What is the safety bar and who owns the policy (allowed/refused content), and do we need region-specific policies?
Do enterprise customers need their private data grounded without it entering training, and under what isolation guarantees?
How fresh must answers be (real-time, daily, training-cutoff is fine), which decides whether we need retrieval/tools vs. periodic retraining?

Part 1 — Requirements and base model

State the functional and non-functional requirements and choose the base architecture and size strategy. Decide what the base (pretrained) model must provide before any alignment, and how model size trades off against the streaming-latency budget and serving cost.

What This Part Should Cover Premium

Part 2 — Alignment: turning a base model into an assistant

Describe the post-training pipeline that makes the base model follow instructions, hold a multi-turn conversation, refuse unsafe requests, and match human preferences. Cover supervised fine-tuning and reinforcement learning from human feedback (or a preference-optimization alternative), and how you build the data for each.

Clarifying Questions for this Part

Do we have or can we collect a large pool of human labelers for demonstrations and pairwise preferences, and at what quality bar?
Is there an existing content policy we must encode, or are we defining the allowed/refused taxonomy ourselves?

What This Part Should Cover Premium

Part 3 — Grounding, freshness, tools, and memory

The base model's knowledge is frozen at training cutoff and cannot contain private data. Design how the assistant stays factual, current, and personalized: retrieval-augmented generation, tool/function calling, and any cross-session memory.

What This Part Should Cover Premium

Part 4 — Evaluation, serving, and safety in production

Define how you measure assistant quality offline and online, how you serve streaming responses at scale within budget, and how you keep it safe and monitored in production.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Users report the assistant is "too cautious" and refuses benign requests. How do you measure over-refusal and fix it without weakening real safety?
The model hallucinates citations even with retrieval enabled. How do you diagnose whether the failure is in retrieval, reranking, or generation, and how do you fix it?
How would you let an enterprise ground answers in its private corpus while guaranteeing that corpus never enters your training set or another tenant's responses?
Describe how you would A/B test a new alignment recipe in production without exposing users to a regression in safety.

Design an LLM-Based Conversational Assistant (Chatbot)

Constraints & Assumptions

General-purpose consumer + enterprise assistant; broad-domain questions, multi-turn conversations, some tool use (search, calculator, internal knowledge bases).
Responses are streamed ; users expect the first token within a few hundred milliseconds and a fluent token rate after that.
The assistant must avoid clearly harmful outputs (e.g., disallowed content), refuse appropriately, and avoid confidently stating false facts where it can be grounded instead.
The world changes after training cutoff, and some answers depend on private/enterprise data the model never saw.
You have a large pretraining budget and access to standard web-scale text plus the ability to collect human feedback data.
Start from a decoder-only autoregressive transformer base model unless you argue otherwise.

Clarifying Questions to Ask

What is the primary objective — engagement/retention, task success, enterprise deployment, or safety-sensitive use cases — and what is the headline metric?
What is the latency budget (first-token and tokens/sec) and the expected concurrency, since that bounds model size and serving design?
Which capabilities are in scope for v1: pure chat, retrieval-grounded answers, tool/function calling, long-term memory across sessions?
What is the safety bar and who owns the policy (allowed/refused content), and do we need region-specific policies?
Do enterprise customers need their private data grounded without it entering training, and under what isolation guarantees?
How fresh must answers be (real-time, daily, training-cutoff is fine), which decides whether we need retrieval/tools vs. periodic retraining?

Part 1 — Requirements and base model

What This Part Should Cover Premium

Part 2 — Alignment: turning a base model into an assistant

Clarifying Questions for this Part

Do we have or can we collect a large pool of human labelers for demonstrations and pairwise preferences, and at what quality bar?
Is there an existing content policy we must encode, or are we defining the allowed/refused taxonomy ourselves?

What This Part Should Cover Premium

Part 3 — Grounding, freshness, tools, and memory

What This Part Should Cover Premium

Part 4 — Evaluation, serving, and safety in production

Define how you measure assistant quality offline and online, how you serve streaming responses at scale within budget, and how you keep it safe and monitored in production.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Users report the assistant is "too cautious" and refuses benign requests. How do you measure over-refusal and fix it without weakening real safety?
The model hallucinates citations even with retrieval enabled. How do you diagnose whether the failure is in retrieval, reranking, or generation, and how do you fix it?
How would you let an enterprise ground answers in its private corpus while guaranteeing that corpus never enters your training set or another tenant's responses?
Describe how you would A/B test a new alignment recipe in production without exposing users to a regression in safety.

Design an LLM-Based Conversational Assistant (Chatbot)

Quick Overview

Design an LLM-Based Conversational Assistant (Chatbot)

Design an LLM-Based Conversational Assistant (Chatbot)

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Requirements and base model

What This Part Should Cover Premium

Part 2 — Alignment: turning a base model into an assistant

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 3 — Grounding, freshness, tools, and memory

What This Part Should Cover Premium

Part 4 — Evaluation, serving, and safety in production

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP

Design an LLM-Based Conversational Assistant (Chatbot)

Quick Overview

Design an LLM-Based Conversational Assistant (Chatbot)

Design an LLM-Based Conversational Assistant (Chatbot)

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Requirements and base model

What This Part Should Cover Premium

Part 2 — Alignment: turning a base model into an assistant

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 3 — Grounding, freshness, tools, and memory

What This Part Should Cover Premium

Part 4 — Evaluation, serving, and safety in production

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP