How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a hard difficulty ML System Design question, commonly asked during Onsite rounds at Meta.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Meta during technical interviews.

Design an LLM-Based Coding Assistant | Meta Interview Question

Q: Design an LLM-Based Coding Assistant

This question evaluates a candidate's ability to design an end-to-end machine learning system, covering model architecture, training data pipelines, evaluation methodology, and inference serving under latency constraints. It tests ML system design skills at a practical, applied level, commonly used to assess how candidates balance competing requirements like speed, cost, and quality in production ML systems.

Design an LLM-Based Coding Assistant

You are asked, in a research-design round, to design the model and end-to-end system behind a large-language-model coding assistant. The product spans two surfaces: (1) low-latency inline code completion inside an IDE (including fill-in-the-middle, where the cursor sits between existing code), and (2) a chat / agentic mode that can read a repository, edit multiple files, run tools, and iterate until a task is done.

Walk through how you would build the model that powers this product: the data and training recipe, how you would evaluate it, and how you would serve it under real latency budgets. Justify the major decisions and the trade-offs.

Constraints & Assumptions

Target users: professional software engineers working in large, multi-language repositories.
Two latency regimes: inline completion needs a first token in roughly tens of milliseconds to a few hundred ms ; agentic / chat mode can tolerate seconds of latency but performs many sequential model calls per task.
Languages: a long tail (Python, JS/TS, Java, Go, C++, SQL, etc.), with a few head languages dominating traffic.
Context: the model must use surrounding file content, other open files, and ideally repository-level context; effective context lengths of tens of thousands of tokens are expected.
You have a large pretraining compute budget and access to public code plus permissively obtainable internal/customer data (with the legal and privacy constraints that implies).
Assume an autoregressive decoder-only transformer family is the starting point unless you argue otherwise.

Clarifying Questions to Ask

What is the primary success metric the business cares about — completion acceptance rate, task-completion rate on agentic tasks, retention, or paid conversion?
What is the hard latency budget (and percentile) for inline completion, and what model-size envelope does that imply at our serving hardware?
Which capabilities are in scope for v1: single-line completion only, multi-line / whole-function, fill-in-the-middle, repository-aware edits, tool use / terminal execution?
What are the licensing, privacy, and data-retention constraints on training data and on customer code at inference time?
Do we need on-prem / VPC deployment for enterprise customers, and does that cap model size or preclude certain serving tricks?
What is the acceptable rate of insecure or licensed-code regurgitation, and who owns that safety bar?

Part 1 — Capabilities, requirements, and model family

Define the functional and non-functional requirements and pick the model architecture and a size strategy. In particular, decide how you support fill-in-the-middle (FIM), long repository context, and the two latency regimes. Argue for one model vs. a family of models of different sizes.

What This Part Should Cover Premium

Part 2 — Data and training recipe

Describe the full training pipeline: pretraining corpus construction, the code-specific objective(s), and post-training (instruction tuning + reinforcement learning) that turns a base model into an assistant that can follow instructions, do FIM, and complete agentic coding tasks.

Clarifying Questions for this Part

Are we allowed to train on customer repositories, and if so under what opt-in / isolation guarantees?
Do we have an execution sandbox at training time to run unit tests for RL, and what languages does it support?

What This Part Should Cover Premium

Part 3 — Evaluation

Define how you measure quality before launch (offline) and after launch (online). Address why simple benchmarks are insufficient and how you guard against contamination and against optimizing a proxy that diverges from real usefulness.

What This Part Should Cover Premium

Part 4 — Serving and inference

Describe how you serve both regimes within budget: how you hit the inline-completion latency target, how you keep agentic mode cost-effective despite many sequential calls, and how you handle long repository context efficiently.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Inline acceptance rate is up but engineers report the assistant "feels worse" on hard tasks. How do you diagnose whether this is a model, retrieval, ranking, or evaluation problem?
A customer demands an on-prem deployment that cannot fit your chat model on their hardware. How do you preserve quality under that constraint?
How would you detect and prevent verbatim regurgitation of GPL-licensed or secret-containing training code at inference time?
Your RL-from-tests pipeline starts producing code that passes the tests but is unreadable or reward-hacks the harness. How do you fix the reward?

Design an LLM-Based Coding Assistant

Constraints & Assumptions

Target users: professional software engineers working in large, multi-language repositories.
Two latency regimes: inline completion needs a first token in roughly tens of milliseconds to a few hundred ms ; agentic / chat mode can tolerate seconds of latency but performs many sequential model calls per task.
Languages: a long tail (Python, JS/TS, Java, Go, C++, SQL, etc.), with a few head languages dominating traffic.
Context: the model must use surrounding file content, other open files, and ideally repository-level context; effective context lengths of tens of thousands of tokens are expected.
You have a large pretraining compute budget and access to public code plus permissively obtainable internal/customer data (with the legal and privacy constraints that implies).
Assume an autoregressive decoder-only transformer family is the starting point unless you argue otherwise.

Clarifying Questions to Ask

What is the primary success metric the business cares about — completion acceptance rate, task-completion rate on agentic tasks, retention, or paid conversion?
What is the hard latency budget (and percentile) for inline completion, and what model-size envelope does that imply at our serving hardware?
Which capabilities are in scope for v1: single-line completion only, multi-line / whole-function, fill-in-the-middle, repository-aware edits, tool use / terminal execution?
What are the licensing, privacy, and data-retention constraints on training data and on customer code at inference time?
Do we need on-prem / VPC deployment for enterprise customers, and does that cap model size or preclude certain serving tricks?
What is the acceptable rate of insecure or licensed-code regurgitation, and who owns that safety bar?

Part 1 — Capabilities, requirements, and model family

What This Part Should Cover Premium

Part 2 — Data and training recipe

Clarifying Questions for this Part

Are we allowed to train on customer repositories, and if so under what opt-in / isolation guarantees?
Do we have an execution sandbox at training time to run unit tests for RL, and what languages does it support?

What This Part Should Cover Premium

Part 3 — Evaluation

What This Part Should Cover Premium

Part 4 — Serving and inference

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Inline acceptance rate is up but engineers report the assistant "feels worse" on hard tasks. How do you diagnose whether this is a model, retrieval, ranking, or evaluation problem?
A customer demands an on-prem deployment that cannot fit your chat model on their hardware. How do you preserve quality under that constraint?
How would you detect and prevent verbatim regurgitation of GPL-licensed or secret-containing training code at inference time?
Your RL-from-tests pipeline starts producing code that passes the tests but is unreadable or reward-hacks the harness. How do you fix the reward?

Design an LLM-Based Coding Assistant

Quick Overview

Design an LLM-Based Coding Assistant

Design an LLM-Based Coding Assistant

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Capabilities, requirements, and model family

What This Part Should Cover Premium

Part 2 — Data and training recipe

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 3 — Evaluation

What This Part Should Cover Premium

Part 4 — Serving and inference

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP

Design an LLM-Based Coding Assistant

Quick Overview

Design an LLM-Based Coding Assistant

Design an LLM-Based Coding Assistant

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Capabilities, requirements, and model family

What This Part Should Cover Premium

Part 2 — Data and training recipe

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 3 — Evaluation

What This Part Should Cover Premium

Part 4 — Serving and inference

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP