Design an LLM Agent System That Automatically Resolves Jira Tickets and Opens Pull Requests
Company: Datadog
Role: Software Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Technical Screen
## Design an LLM Agent System That Automatically Resolves Jira Tickets and Opens Pull Requests
You are asked to design an autonomous LLM-agent system that ingests an engineering Jira ticket (bug report or small feature request), understands the relevant codebase, implements a fix, and opens a pull request (PR) for human review. The agent operates over a real production code repository, so correctness, safety, and the ability to ground its reasoning in the actual code are paramount.
The interview emphasizes three subsystems in depth:
1. **Retrieval-Augmented Generation (RAG)** over the codebase and supporting context (the ticket, related tickets, docs, prior PRs), so the agent's edits are grounded in the real repository rather than hallucinated.
2. **Tool access via a Model Context Protocol (MCP)–style integration layer** — the structured, permissioned interface through which the agent reads files, runs searches, calls the Jira API, runs tests, and creates the PR.
3. **A sandboxed execution environment** in which the agent can edit code, run builds/tests, and iterate, without endangering production systems or leaking secrets.
Design the end-to-end system, then go deep on these three subsystems: how you build and query the RAG index, how you structure and secure the MCP tool layer, and how you isolate and control the sandbox.
```hint Frame it as a closed loop
This is an agentic loop, not a single LLM call: **retrieve context → plan → act via tools → execute in sandbox → observe test/build results → reflect → repeat → open PR**. Most of the design quality lives in the *grounding* (RAG), the *action interface* (MCP tools), and the *verification* (sandbox tests), not in the prompt.
```
```hint Anchor every decision to a failure mode
For each subsystem, name the concrete failure it prevents: RAG prevents *hallucinated symbols / wrong file edits*; MCP prevents *unscoped, dangerous, or non-reproducible actions*; the sandbox prevents *prod damage, secret exfiltration, and unverifiable diffs*. Tie components to those, not to buzzwords.
```
### Constraints & Assumptions
- **Repository scale:** a mid-to-large monorepo, on the order of $10^4$–$10^6$ files and tens of millions of lines, multiple languages. Full source cannot fit in a model context window.
- **Ticket volume:** assume a few hundred eligible tickets per day; latency target is minutes-to-tens-of-minutes per ticket (asynchronous, not interactive), not sub-second.
- **Eligible tickets:** scoped bugs and small features where a fix is plausibly a localized diff (a handful of files). Large refactors / architectural changes are routed to humans.
- **Human-in-the-loop:** the agent never merges. It opens a PR; a human reviews and merges. The agent may push follow-up commits in response to review or CI feedback.
- **Safety:** the agent must never touch production infrastructure, must never exfiltrate secrets, and all code execution happens in an isolated sandbox. Build/test must pass before a PR is opened.
- **Models:** assume access to a strong general-purpose LLM with tool-calling, plus a smaller/cheaper model and an embedding model. Token budgets and per-ticket cost matter.
### Clarifying Questions to Ask
- **Scope of autonomy:** Should the agent only open PRs for a triaged subset of tickets (e.g., labeled `agent-eligible`, low-risk), or attempt everything and self-abstain? Who owns the abstain/escalation decision?
- **Definition of success:** Is the target metric "PR opened," "PR that passes CI," "PR merged by a human with minimal edits," or "ticket actually resolved in production"? This changes evaluation and gating.
- **Repository access model:** Do we get a full clone per ticket, a persistent indexed mirror, or read-only API access? How fresh must the index be relative to `main`?
- **Languages and build systems:** One language/build or many? Are there reliable test suites and a deterministic build we can run in the sandbox?
- **Tool surface and permissions:** Which external systems may the agent call (Jira read/write, GitHub/GitLab, CI, internal services), and what is explicitly forbidden (deploys, prod DBs, secret stores)?
- **Secrets and data sensitivity:** Does the repo contain secrets or regulated data? Can code/snippets be sent to the model provider, or must we use a self-hosted / VPC model?
### Part 1 — End-to-End Architecture and the Agentic Loop
Lay out the full system from "a Jira ticket arrives" to "a PR is open and linked back on the ticket." Define the major components and how a single ticket flows through them, including where the agent decides to abstain/escalate.
```hint Components to name
Ticket ingestion + eligibility gate → orchestration/agent runtime (the planner/executor loop with state) → RAG/context service → MCP tool layer → sandbox executor → PR/VCS integration → human review + feedback loop. Add a memory/state store so a run can resume and so the agent can reflect across loop iterations.
```
```hint The loop, concretely
One iteration: gather context (RAG) → propose a plan/edit → apply edit in sandbox via tools → run build+tests → read results → if failing, reflect and revise; if passing and confident, write a PR description and open the PR; bounded retries, then escalate to a human with a partial diff + notes.
```
#### Clarifying Questions for this Part
- What is the eligibility gate — labels, a classifier, heuristics on ticket size/risk — and is it a hard filter or a soft prior the agent can override?
- What are the termination conditions: max loop iterations, wall-clock/cost budget, tests-green, or low-confidence abstain?
- How is per-run state persisted so a long run can resume and so we have an audit trail of every action the agent took?
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — RAG Over the Codebase
Design the retrieval layer that grounds the agent in the actual repository. Cover what you index, how you chunk and embed it, how you keep it fresh against a moving `main`, and how you retrieve and assemble context for a given ticket so the agent edits the *right* files with the *right* symbols.
```hint Chunk on structure, not on bytes
Naive fixed-size text chunks shred functions and break symbol references. Chunk along **syntactic boundaries** (functions, classes, methods) using a parser/AST (e.g., tree-sitter), and attach metadata: file path, language, symbol name, imports, and surrounding signatures.
```
```hint Retrieval is hybrid + structural, not just vector kNN
Combine **dense embeddings** (semantic similarity to the ticket) with **lexical/symbol search** (exact identifiers, error strings, stack-trace symbols) and **code-graph signals** (call graph, imports, definition/usage). Rerank, then assemble a token-budgeted context. A stack trace or failing test in the ticket is a high-precision entry point — exploit it.
```
```hint Freshness and grounding
The index must track `main` (incremental re-index on merge, or per-ticket clone at the ticket's base SHA) or the agent will edit stale code. At edit time, prefer the agent re-reading exact file regions via tools so diffs apply to ground-truth bytes rather than to possibly-stale retrieved snippets.
```
#### Clarifying Questions for this Part
- How fresh must retrieval be — is per-ticket indexing at the base commit acceptable, or do we need a continuously updated shared index?
- Are stack traces, failing test names, or error logs reliably present on tickets? Those are the strongest retrieval anchors.
- What is the token budget for assembled context, and what is the precision target (edit the wrong file → wasted/incorrect PR)?
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — MCP Tool Layer (the Agent's Action Interface)
Design the structured tool interface (an MCP-style server/protocol) through which the agent takes all of its actions: reading/searching code, editing files, running builds/tests, calling Jira, and opening the PR. Specify the tool catalog, the schemas, and — critically — the permissioning, sandboxing, and observability of every tool call.
```hint Tools are the trust boundary
The LLM should never get raw shell or raw credentials. It calls **typed, least-privilege tools** with validated arguments; the MCP server holds the credentials and enforces scope. Design tools so that *every* side effect is mediated, logged, and reversible (e.g., edits are diffs against a sandbox working copy, not in-place prod writes).
```
```hint Catalog + guardrails
A reasonable catalog: `search_code`, `read_file(path, range)`, `list_dir`, `apply_patch(diff)`, `run_tests(selector)`, `run_build`, `get_ticket`, `comment_ticket`, `open_pr`. Per tool define: input schema, output schema, scope/permission, rate/cost limits, idempotency, and what is *forbidden* (no `deploy`, no prod DB, no secret reads, no network egress except an allowlist).
```
#### Clarifying Questions for this Part
- Which write-capable tools require human approval vs. run autonomously (e.g., `open_pr` and `comment_ticket` are external side effects)?
- How are credentials injected — does the MCP server hold tokens and the model only sees opaque tool calls, never the secrets?
- What are the rate/cost ceilings per tool to bound runaway loops (e.g., max `run_tests` invocations per run)?
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 4 — Sandboxed Execution and Verification
Design the isolated environment where the agent applies edits, builds, and runs tests, and define how test/build results gate whether a PR is opened. Cover isolation, reproducibility, secret handling, resource limits, and the loop that turns red tests into revised edits.
```hint Isolation model
Run each ticket in an **ephemeral, isolated environment** (container/microVM) with: a fresh checkout at the base SHA, no production network access (egress allowlist only), no production secrets, CPU/memory/time limits, and teardown after the run. The agent's `run_tests`/`run_build` execute *inside* this sandbox, never on shared infra.
```
```hint Verification is the gate, not a formality
Define a hard gate: build must succeed and the relevant tests (plus a regression subset) must pass before `open_pr`. Use the failing-test signal as feedback for the next loop iteration (reflexion). If after $N$ iterations it can't go green, abstain and escalate with the partial diff + logs rather than open a broken PR.
```
#### Clarifying Questions for this Part
- Is there a deterministic, reproducible build/test setup we can containerize, and how long do full test runs take (affects budget and whether we run targeted subsets)?
- What network egress, if any, does the build legitimately need (package mirrors), and how do we allowlist it without enabling exfiltration?
- What is the maximum iteration/cost budget before the run abstains and escalates to a human?
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
Quick Answer: This ML system design question evaluates the ability to architect an autonomous LLM-agent pipeline spanning retrieval-augmented generation, tool integration, and sandboxed code execution. It tests conceptual understanding of how to ground agent reasoning in real codebases, structure permissioned tool access, and safely isolate execution environments — core competencies for senior roles building AI-powered developer tooling.