Retrieval-Augmented Generation Systems
Asked of: Software Engineer
Last updated

What's being tested
Interviewers are probing whether you can design a retrieval-augmented generation system as a production backend, not just describe “vector search plus an LLM.” For Harvey, the hard parts are document-scale ingestion, permission-safe retrieval, low-latency serving, citation fidelity, and graceful behavior when model outputs are imperfect. A strong Software Engineer answer should decompose the system into APIs, storage, indexing, retrieval, orchestration, observability, and failure handling. Expect follow-ups on scale, consistency, access control, latency, and how you would debug bad answers.
Core knowledge
-
RAG architecture usually has two paths: an offline indexing pipeline and an online query-serving path. Indexing parses documents, chunks text, computes embeddings, stores metadata, and builds search indexes; serving embeds the query, retrieves candidates, constructs context, calls the LLM, and returns citations.
-
Chunking strategy strongly affects retrieval quality and system behavior. Typical chunks are 300–1,000 tokens with 10–20% overlap; smaller chunks improve pinpoint citations but lose context, while larger chunks reduce fragmentation but waste prompt budget and can bury relevant facts.
-
Embedding retrieval maps chunks and queries into vectors and ranks by cosine similarity:
Use approximate nearest neighbor indexes such as HNSW in
pgvector,Pinecone,Weaviate,Milvus, orOpenSearchk-NNwhen exact scan becomes too slow. -
Hybrid search combines dense vector retrieval with lexical search such as
BM25. This matters for legal and enterprise documents because exact terms, party names, clause numbers, citations, and defined terms may be missed by pure embeddings. -
Metadata filtering is not optional. Store tenant ID, matter ID, document ID, source, version, page ranges, timestamps, and ACL information alongside each chunk. For confidential legal data, enforce permission filters before or during retrieval, not after generation.
-
Reranking improves top-k quality by applying a more expensive model or scoring function to a smaller candidate set. A common pattern is to retrieve top 100 via vector/BM25, rerank to top 10–20, then pack only the best chunks into the prompt.
-
Context assembly is a backend algorithm. Deduplicate overlapping chunks, preserve source ordering where helpful, avoid including contradictory stale versions, and fit within the model’s context window. Use citation IDs like
[doc_id:page:chunk_id]so the final answer can be traced back. -
Latency budgeting should be explicit. A typical online path may include query embedding, vector search, lexical search, reranking, prompt construction, LLM first-token latency, and streaming. Optimize for
p95/p99, not average latency; slow rerankers or cold LLM calls dominate tail latency. -
Caching can help at multiple layers: parsed documents, embeddings, retrieval results for repeated queries, and LLM responses for deterministic prompts. Be careful caching user-visible answers when permissions, document versions, or prompt templates change.
-
Versioning and consistency are central in document systems. If a user uploads a new contract version, decide whether queries require read-after-write indexing or can tolerate eventual consistency. Use document version IDs so citations never point to replaced content accidentally.
-
Observability should expose retrieval and generation internals: retrieved chunk IDs, similarity scores, applied filters, prompt token counts, model latency, error rates, empty retrieval rates, and user feedback. Without retrieval traces, “the answer was bad” is nearly impossible to debug.
-
Failure modes need designed responses. If retrieval returns no evidence, the system should say it cannot find support rather than hallucinate. If an LLM call times out, return a retryable error, partial streamed response, or fallback result depending on product requirements and correctness risk.
Worked example
For Design a Retrieval-Augmented Generation System, start by clarifying the corpus and serving requirements: “Are we indexing user-uploaded legal documents, public law, or both? What is the expected document count, query QPS, latency target, and permission model?” Then declare assumptions, for example: multi-tenant document Q&A, millions of chunks, strict tenant isolation, citations required, and a p95 target under a few seconds excluding long streamed generation.
Organize the answer around four pillars: ingestion/indexing, retrieval/ranking, generation/citation assembly, and production reliability. For ingestion, describe upload APIs, document parsing, chunking, embedding jobs, metadata storage in Postgres, and vector indexing in something like pgvector, OpenSearch, or a managed vector database. For serving, explain query embedding, ACL-filtered hybrid retrieval, optional reranking, prompt construction with chunk IDs, LLM call, and streamed response.
A specific tradeoff to flag is pre-filtering versus post-filtering for permissions. Pre-filtering by tenant and ACL inside the search query is safer and avoids leaking unauthorized chunks into prompts, but it can reduce recall or complicate index design; post-filtering is simpler but dangerous because unauthorized text may reach the model. Close by saying that if you had more time, you would go deeper on index freshness, backfills, evaluation harnesses using golden Q&A sets, and operational dashboards for retrieval misses and citation failures.
A second angle
For Debug Poor Answer Quality in a RAG System, the same architecture becomes a diagnostic flow rather than a greenfield design. First separate retrieval failure from generation failure: inspect whether the correct source chunks appeared in the top-k results before blaming the LLM. If the right chunks are absent, investigate parsing quality, chunk boundaries, embedding model changes, metadata filters, lexical search gaps, or stale indexes. If the right chunks are present but the answer is wrong, inspect prompt construction, context ordering, token truncation, citation formatting, and model timeout or streaming behavior. The key difference is that the best answer is evidence-driven: show logs, retrieved chunk IDs, scores, and prompt snapshots rather than making broad claims about “improving the model.”
Common pitfalls
Pitfall: Treating RAG as “put documents in a vector DB and ask
GPT-4.”
That answer misses the engineering surface area: parsing, versioning, permissions, indexing freshness, reranking, prompt packing, latency, and observability. A better answer names each component and explains the contract between them.
Pitfall: Ignoring access control until the end.
In Harvey-like systems, confidential documents are core data, not incidental data. Say explicitly that ACL and tenant filters must be enforced before retrieved text enters the prompt, and that logs, caches, and traces must avoid leaking sensitive content across tenants.
Pitfall: Over-indexing on ML quality and under-explaining distributed-system behavior.
It is tempting to discuss embedding model choices or hallucination theory at length. For a Software Engineer interview, spend more time on APIs, storage schemas, async indexing, retries, idempotency, backpressure, tail latency, and how to debug production failures.
Connections
Interviewers may pivot from this topic into search infrastructure, distributed job processing, multi-tenant authorization, LLM serving APIs, or observability design. They may also ask for an evaluation layer, but keep the answer engineering-focused: golden datasets, regression tests, trace inspection, and measurable retrieval/citation failure rates.
Further reading
-
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — the original RAG paper; useful for understanding the retrieval-plus-generation framing.
-
The Illustrated Word2vec — helpful intuition for embeddings and vector similarity without going deep into model architecture.
-
HNSW: Efficient and Robust Approximate Nearest Neighbor Search — the core ANN algorithm behind many vector search systems.