How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a hard difficulty ML System Design question, commonly asked during Onsite rounds at Harvey.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Harvey during technical interviews.

Design a RAG Q&A Agent over Law Firm Legal Memos

Q: Design a RAG Q&A Agent over Law Firm Legal Memos

This question evaluates a candidate's ability to design a retrieval-augmented generation system, covering ingestion, chunking, indexing, retrieval, and grounded answer generation. It tests judgment on balancing retrieval quality against system scale, a common ML system design theme for assessing practical, production-level architecture skills.

Design a retrieval-augmented generation (RAG) AI agent that answers legal questions by grounding its responses in published memos and client alerts from large law firms.

Big law firms regularly publish memos, client alerts, and legal updates on their public websites (for example, summaries of a new regulation, a court ruling, or guidance on a deal structure). A user — typically a lawyer or an in-house counsel — asks a natural-language question such as "What did firms say about the new SEC climate disclosure rule and which compliance deadlines did they flag?" The system must crawl and ingest these memos, retrieve the passages most relevant to the question, and produce a grounded, citation-backed answer that links every claim back to the source memo. Answers that are not supported by an indexed memo should be refused rather than hallucinated.

You own the system end to end: the crawling/ingestion pipeline, the chunking and embedding/index design, the retrieval layer, the answer-generation layer (LLM prompting, grounding, and citations), and the offline/online evaluation and monitoring. A recurring point of confusion is whether the hard part of this problem is retrieval quality (indexing/recall) or system scale. A strong design treats both as first-class and is explicit about where the engineering effort actually goes.

Constraints & Assumptions

Corpus : ~500 large law firms, each publishing 5–50 memos/week. Assume a steady state of roughly 1–5 million memos, growing by tens of thousands per week. Memos are mostly HTML pages and PDFs, 500–5,000 words each, in English.
Freshness : Newly published memos should be answerable within a few hours of publication; legal questions are often time-sensitive.
Query load : Start at ~10 queries/second peak from professional users; design so it can grow ~10x without a re-architecture.
Latency : End-to-end answer latency target of a few seconds (p95 ≤ ~5 s) is acceptable for a "research assistant" UX; streaming the answer token-by-token is allowed.
Correctness bar is high : This is a legal/professional context. A confidently wrong, uncited, or fabricated-citation answer is worse than "I couldn't find a relevant memo." Every factual claim in an answer must be attributable to a retrieved passage.
You may use a hosted LLM and a hosted embedding model; assume per-token API cost matters at scale.
Assume you only ingest publicly published memos (no paywalled or privileged content), and you respect each site's robots.txt and terms.

Clarifying Questions to Ask

Users and scope : Who is the user (practicing attorney, in-house counsel, paralegal) and what is the primary job-to-be-done — quick lookup, comparative research across firms, or drafting support? Is this single-turn Q&A or a multi-turn conversational agent?
Source of truth : Is the answer strictly limited to ingested firm memos, or can the LLM also use its own parametric knowledge? (This drives how aggressively we must refuse and cite.)
Coverage vs. precision : When no relevant memo exists, do we prefer to abstain, or return a low-confidence general answer with a clear disclaimer?
Freshness vs. cost : How fresh must answers be — minutes, hours, or daily? This sets the crawl cadence and re-index strategy.
Attribution requirements : Do answers need clickable citations to specific passages, firm/date metadata, and "as of" disclaimers? Is jurisdiction filtering required?
Evaluation availability : Do we have access to legal experts to label answer quality, or must we bootstrap evaluation with weak/synthetic labels first?

Part 1 — Crawling and document ingestion

Design the pipeline that discovers, fetches, parses, and normalizes law-firm memos into clean, structured documents ready for indexing. Address how you find new memos across ~500 heterogeneous firm websites, how you keep the corpus fresh without re-crawling everything, and how you handle HTML vs. PDF, deduplication, and document metadata (firm, authors, publish date, practice area, jurisdiction).

What This Part Should Cover Premium

Part 2 — Chunking, embeddings, and the retrieval index

Design how documents become retrievable units and how the system retrieves the right passages for a query. Cover the chunking strategy, the embedding model choice, the vector index, and how you maximize recall and precision. The poster's core question lives here: how do you optimize indexing/recall so the LLM is actually given the right evidence?

Clarifying Questions for this Part

Are queries usually about a single topic/event (favoring precision) or comparative across firms (favoring recall + grouping by firm)?
Is recency a hard filter (only memos after date X) or a soft ranking signal?

What This Part Should Cover Premium

Part 3 — Grounded answer generation and citations

Design the generation layer: how retrieved passages are assembled into a prompt, how the LLM is constrained to answer only from provided evidence, how citations are attached, and how you minimize hallucination and fabricated citations.

What This Part Should Cover Premium

Part 4 — Evaluation, monitoring, and scaling

Define how you measure whether the system is good and how you operate it. Cover offline retrieval and answer-quality evaluation, online metrics, and the scaling/cost story for both ingestion and query serving (this is the "scale up" half of the poster's question).

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

A memo is updated or retracted after publication (e.g. a firm corrects guidance). How do you detect this, propagate it through the index, and avoid citing stale or withdrawn guidance?
Your retrieval recall@10 is high but users still report wrong answers. How do you localize the failure — retrieval, re-ranking, prompt, or the generator — and what experiment would you run?
How would you extend this from single-turn Q&A to a multi-turn research agent that can do follow-up retrieval (agentic / iterative retrieval), and what new failure modes does that introduce?
How would you support comparative queries ("how do firms differ on X?") that require aggregating and contrasting evidence across many firms rather than answering from a single passage?

Design a retrieval-augmented generation (RAG) AI agent that answers legal questions by grounding its responses in published memos and client alerts from large law firms.

Constraints & Assumptions

Corpus : ~500 large law firms, each publishing 5–50 memos/week. Assume a steady state of roughly 1–5 million memos, growing by tens of thousands per week. Memos are mostly HTML pages and PDFs, 500–5,000 words each, in English.
Freshness : Newly published memos should be answerable within a few hours of publication; legal questions are often time-sensitive.
Query load : Start at ~10 queries/second peak from professional users; design so it can grow ~10x without a re-architecture.
Latency : End-to-end answer latency target of a few seconds (p95 ≤ ~5 s) is acceptable for a "research assistant" UX; streaming the answer token-by-token is allowed.
Correctness bar is high : This is a legal/professional context. A confidently wrong, uncited, or fabricated-citation answer is worse than "I couldn't find a relevant memo." Every factual claim in an answer must be attributable to a retrieved passage.
You may use a hosted LLM and a hosted embedding model; assume per-token API cost matters at scale.
Assume you only ingest publicly published memos (no paywalled or privileged content), and you respect each site's robots.txt and terms.

Clarifying Questions to Ask

Users and scope : Who is the user (practicing attorney, in-house counsel, paralegal) and what is the primary job-to-be-done — quick lookup, comparative research across firms, or drafting support? Is this single-turn Q&A or a multi-turn conversational agent?
Source of truth : Is the answer strictly limited to ingested firm memos, or can the LLM also use its own parametric knowledge? (This drives how aggressively we must refuse and cite.)
Coverage vs. precision : When no relevant memo exists, do we prefer to abstain, or return a low-confidence general answer with a clear disclaimer?
Freshness vs. cost : How fresh must answers be — minutes, hours, or daily? This sets the crawl cadence and re-index strategy.
Attribution requirements : Do answers need clickable citations to specific passages, firm/date metadata, and "as of" disclaimers? Is jurisdiction filtering required?
Evaluation availability : Do we have access to legal experts to label answer quality, or must we bootstrap evaluation with weak/synthetic labels first?

Part 1 — Crawling and document ingestion

What This Part Should Cover Premium

Part 2 — Chunking, embeddings, and the retrieval index

Clarifying Questions for this Part

Are queries usually about a single topic/event (favoring precision) or comparative across firms (favoring recall + grouping by firm)?
Is recency a hard filter (only memos after date X) or a soft ranking signal?

What This Part Should Cover Premium

Part 3 — Grounded answer generation and citations

What This Part Should Cover Premium

Part 4 — Evaluation, monitoring, and scaling

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

A memo is updated or retracted after publication (e.g. a firm corrects guidance). How do you detect this, propagate it through the index, and avoid citing stale or withdrawn guidance?
Your retrieval recall@10 is high but users still report wrong answers. How do you localize the failure — retrieval, re-ranking, prompt, or the generator — and what experiment would you run?
How would you extend this from single-turn Q&A to a multi-turn research agent that can do follow-up retrieval (agentic / iterative retrieval), and what new failure modes does that introduce?
How would you support comparative queries ("how do firms differ on X?") that require aggregating and contrasting evidence across many firms rather than answering from a single passage?

Design a RAG Q&A Agent over Law Firm Legal Memos

Quick Overview

Design a RAG Q&A Agent over Law Firm Legal Memos

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Crawling and document ingestion

What This Part Should Cover Premium

Part 2 — Chunking, embeddings, and the retrieval index

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 3 — Grounded answer generation and citations

What This Part Should Cover Premium

Part 4 — Evaluation, monitoring, and scaling

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP

Design a RAG Q&A Agent over Law Firm Legal Memos

Quick Overview

Design a RAG Q&A Agent over Law Firm Legal Memos

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Crawling and document ingestion

What This Part Should Cover Premium

Part 2 — Chunking, embeddings, and the retrieval index

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 3 — Grounded answer generation and citations

What This Part Should Cover Premium

Part 4 — Evaluation, monitoring, and scaling

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP