PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/ML System Design/Meta

Design an LLM-Based Coding Assistant

Last updated: Jul 1, 2026

Quick Overview

This question evaluates a candidate's ability to design an end-to-end machine learning system, covering model architecture, training data pipelines, evaluation methodology, and inference serving under latency constraints. It tests ML system design skills at a practical, applied level, commonly used to assess how candidates balance competing requirements like speed, cost, and quality in production ML systems.

  • hard
  • Meta
  • ML System Design
  • Machine Learning Engineer

Design an LLM-Based Coding Assistant

Company: Meta

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

## Design an LLM-Based Coding Assistant You are asked, in a research-design round, to design the model and end-to-end system behind a large-language-model **coding assistant**. The product spans two surfaces: (1) low-latency inline code completion inside an IDE (including fill-in-the-middle, where the cursor sits between existing code), and (2) a chat / agentic mode that can read a repository, edit multiple files, run tools, and iterate until a task is done. Walk through how you would build the model that powers this product: the data and training recipe, how you would evaluate it, and how you would serve it under real latency budgets. Justify the major decisions and the trade-offs. ### Constraints & Assumptions - Target users: professional software engineers working in large, multi-language repositories. - Two latency regimes: inline completion needs a first token in roughly **tens of milliseconds to a few hundred ms**; agentic / chat mode can tolerate seconds of latency but performs many sequential model calls per task. - Languages: a long tail (Python, JS/TS, Java, Go, C++, SQL, etc.), with a few head languages dominating traffic. - Context: the model must use surrounding file content, other open files, and ideally repository-level context; effective context lengths of tens of thousands of tokens are expected. - You have a large pretraining compute budget and access to public code plus permissively obtainable internal/customer data (with the legal and privacy constraints that implies). - Assume an autoregressive decoder-only transformer family is the starting point unless you argue otherwise. ### Clarifying Questions to Ask - What is the primary success metric the business cares about — completion acceptance rate, task-completion rate on agentic tasks, retention, or paid conversion? - What is the hard latency budget (and percentile) for inline completion, and what model-size envelope does that imply at our serving hardware? - Which capabilities are in scope for v1: single-line completion only, multi-line / whole-function, fill-in-the-middle, repository-aware edits, tool use / terminal execution? - What are the licensing, privacy, and data-retention constraints on training data and on customer code at inference time? - Do we need on-prem / VPC deployment for enterprise customers, and does that cap model size or preclude certain serving tricks? - What is the acceptable rate of insecure or licensed-code regurgitation, and who owns that safety bar? ### Part 1 — Capabilities, requirements, and model family Define the functional and non-functional requirements and pick the model architecture and a size strategy. In particular, decide how you support fill-in-the-middle (FIM), long repository context, and the two latency regimes. Argue for one model vs. a family of models of different sizes. ```hint Two regimes, possibly two models The inline-completion latency budget and the agentic-quality budget pull in opposite directions on model size — consider a small fast model for inline and a larger model for chat/agentic, and what they can share (tokenizer, training data). ``` ```hint Fill-in-the-middle FIM is a data/objective transformation (split a document into prefix / middle / suffix and reorder with sentinel tokens), not a new architecture — decide the FIM rate during pretraining. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 2 — Data and training recipe Describe the full training pipeline: pretraining corpus construction, the code-specific objective(s), and post-training (instruction tuning + reinforcement learning) that turns a base model into an assistant that can follow instructions, do FIM, and complete agentic coding tasks. ```hint Stages Think in stages: large-scale pretraining (code + some natural language + math) → continued / mid-training to up-weight high-quality code and long context → supervised fine-tuning on instructions and edits → reinforcement learning. ``` ```hint Where RL shines for code Code has a cheap, objective reward signal that most domains lack — executing the code against tests. Use it (execution-feedback / RL-from-verifiable-rewards) rather than relying only on a learned preference model. ``` ```hint Data hygiene Aggressive dedup (exact + near-dup), decontamination against eval benchmarks, license filtering, and secret/PII scrubbing materially change both quality and legal risk. ``` #### Clarifying Questions for this Part - Are we allowed to train on customer repositories, and if so under what opt-in / isolation guarantees? - Do we have an execution sandbox at training time to run unit tests for RL, and what languages does it support? #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 3 — Evaluation Define how you measure quality before launch (offline) and after launch (online). Address why simple benchmarks are insufficient and how you guard against contamination and against optimizing a proxy that diverges from real usefulness. ```hint Execution beats string match For code, exact-match against a reference is misleading; functional correctness via test execution (pass@k) is the right offline signal, and you need held-out, contamination-checked tasks closer to real repo work than toy puzzles. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 4 — Serving and inference Describe how you serve both regimes within budget: how you hit the inline-completion latency target, how you keep agentic mode cost-effective despite many sequential calls, and how you handle long repository context efficiently. ```hint Latency levers KV-cache reuse / prompt caching across keystrokes, speculative decoding, quantization, continuous batching, and truncating/streaming the context are the standard levers — map each to which regime it helps. ``` ```hint Long context is a memory problem The KV cache grows with context length; prefix caching of unchanged repo context and retrieval of only the relevant files keep cost bounded. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - Inline acceptance rate is up but engineers report the assistant "feels worse" on hard tasks. How do you diagnose whether this is a model, retrieval, ranking, or evaluation problem? - A customer demands an on-prem deployment that cannot fit your chat model on their hardware. How do you preserve quality under that constraint? - How would you detect and prevent verbatim regurgitation of GPL-licensed or secret-containing training code at inference time? - Your RL-from-tests pipeline starts producing code that passes the tests but is unreadable or reward-hacks the harness. How do you fix the reward?

Quick Answer: This question evaluates a candidate's ability to design an end-to-end machine learning system, covering model architecture, training data pipelines, evaluation methodology, and inference serving under latency constraints. It tests ML system design skills at a practical, applied level, commonly used to assess how candidates balance competing requirements like speed, cost, and quality in production ML systems.

Related Interview Questions

  • Design an LLM-Based Conversational Assistant (Chatbot) - Meta (hard)
  • Design an Automated Ticket Investigation Agent - Meta (hard)
  • Prevent Private Code Leakage in Coding Agents - Meta (medium)
  • Design a Code Review Agent - Meta (medium)
  • Design Place Recommendation System - Meta (medium)
|Home/ML System Design/Meta

Design an LLM-Based Coding Assistant

Meta logo
Meta
Jun 27, 2026, 12:00 AM
hardMachine Learning EngineerOnsiteML System Design
0
0

Design an LLM-Based Coding Assistant

You are asked, in a research-design round, to design the model and end-to-end system behind a large-language-model coding assistant. The product spans two surfaces: (1) low-latency inline code completion inside an IDE (including fill-in-the-middle, where the cursor sits between existing code), and (2) a chat / agentic mode that can read a repository, edit multiple files, run tools, and iterate until a task is done.

Walk through how you would build the model that powers this product: the data and training recipe, how you would evaluate it, and how you would serve it under real latency budgets. Justify the major decisions and the trade-offs.

Constraints & Assumptions

  • Target users: professional software engineers working in large, multi-language repositories.
  • Two latency regimes: inline completion needs a first token in roughly tens of milliseconds to a few hundred ms ; agentic / chat mode can tolerate seconds of latency but performs many sequential model calls per task.
  • Languages: a long tail (Python, JS/TS, Java, Go, C++, SQL, etc.), with a few head languages dominating traffic.
  • Context: the model must use surrounding file content, other open files, and ideally repository-level context; effective context lengths of tens of thousands of tokens are expected.
  • You have a large pretraining compute budget and access to public code plus permissively obtainable internal/customer data (with the legal and privacy constraints that implies).
  • Assume an autoregressive decoder-only transformer family is the starting point unless you argue otherwise.

Clarifying Questions to Ask

  • What is the primary success metric the business cares about — completion acceptance rate, task-completion rate on agentic tasks, retention, or paid conversion?
  • What is the hard latency budget (and percentile) for inline completion, and what model-size envelope does that imply at our serving hardware?
  • Which capabilities are in scope for v1: single-line completion only, multi-line / whole-function, fill-in-the-middle, repository-aware edits, tool use / terminal execution?
  • What are the licensing, privacy, and data-retention constraints on training data and on customer code at inference time?
  • Do we need on-prem / VPC deployment for enterprise customers, and does that cap model size or preclude certain serving tricks?
  • What is the acceptable rate of insecure or licensed-code regurgitation, and who owns that safety bar?

Part 1 — Capabilities, requirements, and model family

Define the functional and non-functional requirements and pick the model architecture and a size strategy. In particular, decide how you support fill-in-the-middle (FIM), long repository context, and the two latency regimes. Argue for one model vs. a family of models of different sizes.

What This Part Should Cover Premium

Part 2 — Data and training recipe

Describe the full training pipeline: pretraining corpus construction, the code-specific objective(s), and post-training (instruction tuning + reinforcement learning) that turns a base model into an assistant that can follow instructions, do FIM, and complete agentic coding tasks.

Clarifying Questions for this Part

  • Are we allowed to train on customer repositories, and if so under what opt-in / isolation guarantees?
  • Do we have an execution sandbox at training time to run unit tests for RL, and what languages does it support?

What This Part Should Cover Premium

Part 3 — Evaluation

Define how you measure quality before launch (offline) and after launch (online). Address why simple benchmarks are insufficient and how you guard against contamination and against optimizing a proxy that diverges from real usefulness.

What This Part Should Cover Premium

Part 4 — Serving and inference

Describe how you serve both regimes within budget: how you hit the inline-completion latency target, how you keep agentic mode cost-effective despite many sequential calls, and how you handle long repository context efficiently.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

  • Inline acceptance rate is up but engineers report the assistant "feels worse" on hard tasks. How do you diagnose whether this is a model, retrieval, ranking, or evaluation problem?
  • A customer demands an on-prem deployment that cannot fit your chat model on their hardware. How do you preserve quality under that constraint?
  • How would you detect and prevent verbatim regurgitation of GPL-licensed or secret-containing training code at inference time?
  • Your RL-from-tests pipeline starts producing code that passes the tests but is unreadable or reward-hacks the harness. How do you fix the reward?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Meta•More Machine Learning Engineer•Meta Machine Learning Engineer•Meta ML System Design•Machine Learning Engineer ML System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.