PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/ML System Design/Harvey

Design a RAG Q&A Agent over Law Firm Legal Memos

Last updated: Jul 1, 2026

Quick Overview

This question evaluates a candidate's ability to design a retrieval-augmented generation system, covering ingestion, chunking, indexing, retrieval, and grounded answer generation. It tests judgment on balancing retrieval quality against system scale, a common ML system design theme for assessing practical, production-level architecture skills.

  • hard
  • Harvey
  • ML System Design
  • Software Engineer

Design a RAG Q&A Agent over Law Firm Legal Memos

Company: Harvey

Role: Software Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

Design a retrieval-augmented generation (RAG) AI agent that answers legal questions by grounding its responses in published memos and client alerts from large law firms. Big law firms regularly publish memos, client alerts, and legal updates on their public websites (for example, summaries of a new regulation, a court ruling, or guidance on a deal structure). A user — typically a lawyer or an in-house counsel — asks a natural-language question such as *"What did firms say about the new SEC climate disclosure rule and which compliance deadlines did they flag?"* The system must crawl and ingest these memos, retrieve the passages most relevant to the question, and produce a grounded, citation-backed answer that links every claim back to the source memo. Answers that are not supported by an indexed memo should be refused rather than hallucinated. You own the system end to end: the crawling/ingestion pipeline, the chunking and embedding/index design, the retrieval layer, the answer-generation layer (LLM prompting, grounding, and citations), and the offline/online evaluation and monitoring. A recurring point of confusion is whether the hard part of this problem is **retrieval quality** (indexing/recall) or **system scale**. A strong design treats both as first-class and is explicit about where the engineering effort actually goes. ### Constraints & Assumptions - **Corpus**: ~500 large law firms, each publishing 5–50 memos/week. Assume a steady state of roughly 1–5 million memos, growing by tens of thousands per week. Memos are mostly HTML pages and PDFs, 500–5,000 words each, in English. - **Freshness**: Newly published memos should be answerable within a few hours of publication; legal questions are often time-sensitive. - **Query load**: Start at ~10 queries/second peak from professional users; design so it can grow ~10x without a re-architecture. - **Latency**: End-to-end answer latency target of a few seconds (p95 ≤ ~5 s) is acceptable for a "research assistant" UX; streaming the answer token-by-token is allowed. - **Correctness bar is high**: This is a legal/professional context. A confidently wrong, uncited, or fabricated-citation answer is worse than "I couldn't find a relevant memo." Every factual claim in an answer must be attributable to a retrieved passage. - You may use a hosted LLM and a hosted embedding model; assume per-token API cost matters at scale. - Assume you only ingest **publicly published** memos (no paywalled or privileged content), and you respect each site's `robots.txt` and terms. ### Clarifying Questions to Ask - **Users and scope**: Who is the user (practicing attorney, in-house counsel, paralegal) and what is the primary job-to-be-done — quick lookup, comparative research across firms, or drafting support? Is this single-turn Q&A or a multi-turn conversational agent? - **Source of truth**: Is the answer strictly limited to ingested firm memos, or can the LLM also use its own parametric knowledge? (This drives how aggressively we must refuse and cite.) - **Coverage vs. precision**: When no relevant memo exists, do we prefer to abstain, or return a low-confidence general answer with a clear disclaimer? - **Freshness vs. cost**: How fresh must answers be — minutes, hours, or daily? This sets the crawl cadence and re-index strategy. - **Attribution requirements**: Do answers need clickable citations to specific passages, firm/date metadata, and "as of" disclaimers? Is jurisdiction filtering required? - **Evaluation availability**: Do we have access to legal experts to label answer quality, or must we bootstrap evaluation with weak/synthetic labels first? ### Part 1 — Crawling and document ingestion Design the pipeline that discovers, fetches, parses, and normalizes law-firm memos into clean, structured documents ready for indexing. Address how you find new memos across ~500 heterogeneous firm websites, how you keep the corpus fresh without re-crawling everything, and how you handle HTML vs. PDF, deduplication, and document metadata (firm, authors, publish date, practice area, jurisdiction). ```hint Where to start Separate *discovery* (which URLs exist / changed) from *fetching* (download the bytes) from *parsing* (HTML/PDF → clean text + metadata). Per-site adapters/sitemaps + RSS for discovery, a politeness-aware fetch queue, and a content extractor (boilerplate removal) for parsing. ``` ```hint Freshness without full re-crawl Use content hashing + HTTP `ETag`/`Last-Modified` and per-firm publish feeds so you only re-process changed/new pages. Treat ingestion as an incremental, idempotent upsert keyed by a stable document id (canonical URL + content hash). ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 2 — Chunking, embeddings, and the retrieval index Design how documents become retrievable units and how the system retrieves the right passages for a query. Cover the chunking strategy, the embedding model choice, the vector index, and how you maximize recall *and* precision. The poster's core question lives here: how do you optimize indexing/recall so the LLM is actually given the right evidence? ```hint Chunking Chunk on semantic boundaries (headings/paragraphs), not fixed byte windows; keep chunks ~200–500 tokens with overlap, and attach document metadata (firm, date, jurisdiction) to every chunk so you can filter and cite. ``` ```hint Recall and precision together Hybrid retrieval: combine dense (embedding ANN) with sparse lexical (BM25) to catch exact legal terms/citations, then re-rank the union with a cross-encoder. Over-retrieve (e.g. top-50) then re-rank to top-k to trade a little latency for much better precision. ``` ```hint Metadata is part of retrieval Most legal queries are time- and jurisdiction-scoped. Push metadata filters (date range, practice area) into the index so "recent" and "relevant" aren't fighting each other. ``` #### Clarifying Questions for this Part - Are queries usually about a single topic/event (favoring precision) or comparative across firms (favoring recall + grouping by firm)? - Is recency a hard filter (only memos after date X) or a soft ranking signal? #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 3 — Grounded answer generation and citations Design the generation layer: how retrieved passages are assembled into a prompt, how the LLM is constrained to answer **only** from provided evidence, how citations are attached, and how you minimize hallucination and fabricated citations. ```hint Grounding contract Pass numbered passages to the LLM and require it to cite passage ids inline; instruct it to answer "not found in the available memos" when evidence is insufficient. Then *verify* citations post-hoc — drop or flag any claim whose cited passage doesn't actually support it. ``` ```hint Context budget You can't stuff 50 chunks into the prompt. Use re-ranked top-k, optionally compress/summarize passages, and prefer fewer high-precision chunks over many noisy ones. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 4 — Evaluation, monitoring, and scaling Define how you measure whether the system is good and how you operate it. Cover offline retrieval and answer-quality evaluation, online metrics, and the scaling/cost story for both ingestion and query serving (this is the "scale up" half of the poster's question). ```hint Evaluate the two stages separately Retrieval and generation fail differently. Measure retrieval (recall@k, MRR/nDCG against a labeled query→relevant-memo set) independently from answer quality (faithfulness/groundedness, correctness, citation accuracy). A bad answer on good retrieval is a generation bug; a bad answer on bad retrieval is an index/recall bug. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - A memo is updated or retracted after publication (e.g. a firm corrects guidance). How do you detect this, propagate it through the index, and avoid citing stale or withdrawn guidance? - Your retrieval recall@10 is high but users still report wrong answers. How do you localize the failure — retrieval, re-ranking, prompt, or the generator — and what experiment would you run? - How would you extend this from single-turn Q&A to a multi-turn research agent that can do follow-up retrieval (agentic / iterative retrieval), and what new failure modes does that introduce? - How would you support comparative queries ("how do firms differ on X?") that require aggregating and contrasting evidence across many firms rather than answering from a single passage?

Quick Answer: This question evaluates a candidate's ability to design a retrieval-augmented generation system, covering ingestion, chunking, indexing, retrieval, and grounded answer generation. It tests judgment on balancing retrieval quality against system scale, a common ML system design theme for assessing practical, production-level architecture skills.

Related Interview Questions

  • Design a Memo Q&A Agent for a Large Law Firm - Harvey (medium)
|Home/ML System Design/Harvey

Design a RAG Q&A Agent over Law Firm Legal Memos

Harvey logo
Harvey
Jun 18, 2026, 12:00 AM
hardSoftware EngineerOnsiteML System Design
0
0

Design a retrieval-augmented generation (RAG) AI agent that answers legal questions by grounding its responses in published memos and client alerts from large law firms.

Big law firms regularly publish memos, client alerts, and legal updates on their public websites (for example, summaries of a new regulation, a court ruling, or guidance on a deal structure). A user — typically a lawyer or an in-house counsel — asks a natural-language question such as "What did firms say about the new SEC climate disclosure rule and which compliance deadlines did they flag?" The system must crawl and ingest these memos, retrieve the passages most relevant to the question, and produce a grounded, citation-backed answer that links every claim back to the source memo. Answers that are not supported by an indexed memo should be refused rather than hallucinated.

You own the system end to end: the crawling/ingestion pipeline, the chunking and embedding/index design, the retrieval layer, the answer-generation layer (LLM prompting, grounding, and citations), and the offline/online evaluation and monitoring. A recurring point of confusion is whether the hard part of this problem is retrieval quality (indexing/recall) or system scale. A strong design treats both as first-class and is explicit about where the engineering effort actually goes.

Constraints & Assumptions

  • Corpus : ~500 large law firms, each publishing 5–50 memos/week. Assume a steady state of roughly 1–5 million memos, growing by tens of thousands per week. Memos are mostly HTML pages and PDFs, 500–5,000 words each, in English.
  • Freshness : Newly published memos should be answerable within a few hours of publication; legal questions are often time-sensitive.
  • Query load : Start at ~10 queries/second peak from professional users; design so it can grow ~10x without a re-architecture.
  • Latency : End-to-end answer latency target of a few seconds (p95 ≤ ~5 s) is acceptable for a "research assistant" UX; streaming the answer token-by-token is allowed.
  • Correctness bar is high : This is a legal/professional context. A confidently wrong, uncited, or fabricated-citation answer is worse than "I couldn't find a relevant memo." Every factual claim in an answer must be attributable to a retrieved passage.
  • You may use a hosted LLM and a hosted embedding model; assume per-token API cost matters at scale.
  • Assume you only ingest publicly published memos (no paywalled or privileged content), and you respect each site's robots.txt and terms.

Clarifying Questions to Ask

  • Users and scope : Who is the user (practicing attorney, in-house counsel, paralegal) and what is the primary job-to-be-done — quick lookup, comparative research across firms, or drafting support? Is this single-turn Q&A or a multi-turn conversational agent?
  • Source of truth : Is the answer strictly limited to ingested firm memos, or can the LLM also use its own parametric knowledge? (This drives how aggressively we must refuse and cite.)
  • Coverage vs. precision : When no relevant memo exists, do we prefer to abstain, or return a low-confidence general answer with a clear disclaimer?
  • Freshness vs. cost : How fresh must answers be — minutes, hours, or daily? This sets the crawl cadence and re-index strategy.
  • Attribution requirements : Do answers need clickable citations to specific passages, firm/date metadata, and "as of" disclaimers? Is jurisdiction filtering required?
  • Evaluation availability : Do we have access to legal experts to label answer quality, or must we bootstrap evaluation with weak/synthetic labels first?

Part 1 — Crawling and document ingestion

Design the pipeline that discovers, fetches, parses, and normalizes law-firm memos into clean, structured documents ready for indexing. Address how you find new memos across ~500 heterogeneous firm websites, how you keep the corpus fresh without re-crawling everything, and how you handle HTML vs. PDF, deduplication, and document metadata (firm, authors, publish date, practice area, jurisdiction).

What This Part Should Cover Premium

Part 2 — Chunking, embeddings, and the retrieval index

Design how documents become retrievable units and how the system retrieves the right passages for a query. Cover the chunking strategy, the embedding model choice, the vector index, and how you maximize recall and precision. The poster's core question lives here: how do you optimize indexing/recall so the LLM is actually given the right evidence?

Clarifying Questions for this Part

  • Are queries usually about a single topic/event (favoring precision) or comparative across firms (favoring recall + grouping by firm)?
  • Is recency a hard filter (only memos after date X) or a soft ranking signal?

What This Part Should Cover Premium

Part 3 — Grounded answer generation and citations

Design the generation layer: how retrieved passages are assembled into a prompt, how the LLM is constrained to answer only from provided evidence, how citations are attached, and how you minimize hallucination and fabricated citations.

What This Part Should Cover Premium

Part 4 — Evaluation, monitoring, and scaling

Define how you measure whether the system is good and how you operate it. Cover offline retrieval and answer-quality evaluation, online metrics, and the scaling/cost story for both ingestion and query serving (this is the "scale up" half of the poster's question).

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

  • A memo is updated or retracted after publication (e.g. a firm corrects guidance). How do you detect this, propagate it through the index, and avoid citing stale or withdrawn guidance?
  • Your retrieval recall@10 is high but users still report wrong answers. How do you localize the failure — retrieval, re-ranking, prompt, or the generator — and what experiment would you run?
  • How would you extend this from single-turn Q&A to a multi-turn research agent that can do follow-up retrieval (agentic / iterative retrieval), and what new failure modes does that introduce?
  • How would you support comparative queries ("how do firms differ on X?") that require aggregating and contrasting evidence across many firms rather than answering from a single passage?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Harvey•More Software Engineer•Harvey Software Engineer•Harvey ML System Design•Software Engineer ML System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.