Design an LLM-Based Coding Assistant
Company: Meta
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Onsite
## Design an LLM-Based Coding Assistant
You are asked, in a research-design round, to design the model and end-to-end system behind a large-language-model **coding assistant**. The product spans two surfaces: (1) low-latency inline code completion inside an IDE (including fill-in-the-middle, where the cursor sits between existing code), and (2) a chat / agentic mode that can read a repository, edit multiple files, run tools, and iterate until a task is done.
Walk through how you would build the model that powers this product: the data and training recipe, how you would evaluate it, and how you would serve it under real latency budgets. Justify the major decisions and the trade-offs.
### Constraints & Assumptions
- Target users: professional software engineers working in large, multi-language repositories.
- Two latency regimes: inline completion needs a first token in roughly **tens of milliseconds to a few hundred ms**; agentic / chat mode can tolerate seconds of latency but performs many sequential model calls per task.
- Languages: a long tail (Python, JS/TS, Java, Go, C++, SQL, etc.), with a few head languages dominating traffic.
- Context: the model must use surrounding file content, other open files, and ideally repository-level context; effective context lengths of tens of thousands of tokens are expected.
- You have a large pretraining compute budget and access to public code plus permissively obtainable internal/customer data (with the legal and privacy constraints that implies).
- Assume an autoregressive decoder-only transformer family is the starting point unless you argue otherwise.
### Clarifying Questions to Ask
- What is the primary success metric the business cares about — completion acceptance rate, task-completion rate on agentic tasks, retention, or paid conversion?
- What is the hard latency budget (and percentile) for inline completion, and what model-size envelope does that imply at our serving hardware?
- Which capabilities are in scope for v1: single-line completion only, multi-line / whole-function, fill-in-the-middle, repository-aware edits, tool use / terminal execution?
- What are the licensing, privacy, and data-retention constraints on training data and on customer code at inference time?
- Do we need on-prem / VPC deployment for enterprise customers, and does that cap model size or preclude certain serving tricks?
- What is the acceptable rate of insecure or licensed-code regurgitation, and who owns that safety bar?
### Part 1 — Capabilities, requirements, and model family
Define the functional and non-functional requirements and pick the model architecture and a size strategy. In particular, decide how you support fill-in-the-middle (FIM), long repository context, and the two latency regimes. Argue for one model vs. a family of models of different sizes.
```hint Two regimes, possibly two models
The inline-completion latency budget and the agentic-quality budget pull in opposite directions on model size — consider a small fast model for inline and a larger model for chat/agentic, and what they can share (tokenizer, training data).
```
```hint Fill-in-the-middle
FIM is a data/objective transformation (split a document into prefix / middle / suffix and reorder with sentinel tokens), not a new architecture — decide the FIM rate during pretraining.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Data and training recipe
Describe the full training pipeline: pretraining corpus construction, the code-specific objective(s), and post-training (instruction tuning + reinforcement learning) that turns a base model into an assistant that can follow instructions, do FIM, and complete agentic coding tasks.
```hint Stages
Think in stages: large-scale pretraining (code + some natural language + math) → continued / mid-training to up-weight high-quality code and long context → supervised fine-tuning on instructions and edits → reinforcement learning.
```
```hint Where RL shines for code
Code has a cheap, objective reward signal that most domains lack — executing the code against tests. Use it (execution-feedback / RL-from-verifiable-rewards) rather than relying only on a learned preference model.
```
```hint Data hygiene
Aggressive dedup (exact + near-dup), decontamination against eval benchmarks, license filtering, and secret/PII scrubbing materially change both quality and legal risk.
```
#### Clarifying Questions for this Part
- Are we allowed to train on customer repositories, and if so under what opt-in / isolation guarantees?
- Do we have an execution sandbox at training time to run unit tests for RL, and what languages does it support?
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — Evaluation
Define how you measure quality before launch (offline) and after launch (online). Address why simple benchmarks are insufficient and how you guard against contamination and against optimizing a proxy that diverges from real usefulness.
```hint Execution beats string match
For code, exact-match against a reference is misleading; functional correctness via test execution (pass@k) is the right offline signal, and you need held-out, contamination-checked tasks closer to real repo work than toy puzzles.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 4 — Serving and inference
Describe how you serve both regimes within budget: how you hit the inline-completion latency target, how you keep agentic mode cost-effective despite many sequential calls, and how you handle long repository context efficiently.
```hint Latency levers
KV-cache reuse / prompt caching across keystrokes, speculative decoding, quantization, continuous batching, and truncating/streaming the context are the standard levers — map each to which regime it helps.
```
```hint Long context is a memory problem
The KV cache grows with context length; prefix caching of unchanged repo context and retrieval of only the relevant files keep cost bounded.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- Inline acceptance rate is up but engineers report the assistant "feels worse" on hard tasks. How do you diagnose whether this is a model, retrieval, ranking, or evaluation problem?
- A customer demands an on-prem deployment that cannot fit your chat model on their hardware. How do you preserve quality under that constraint?
- How would you detect and prevent verbatim regurgitation of GPL-licensed or secret-containing training code at inference time?
- Your RL-from-tests pipeline starts producing code that passes the tests but is unreadable or reward-hacks the harness. How do you fix the reward?
Quick Answer: This question evaluates a candidate's ability to design an end-to-end machine learning system, covering model architecture, training data pipelines, evaluation methodology, and inference serving under latency constraints. It tests ML system design skills at a practical, applied level, commonly used to assess how candidates balance competing requirements like speed, cost, and quality in production ML systems.