RAG Systems And LLM Retrieval

What's being tested

Retrieval-Augmented Generation (RAG) design tests whether you can build an LLM-backed product that is grounded in private, changing knowledge rather than only in model parameters. For a Software Engineer, the interviewer is probing system decomposition, retrieval algorithms, access control, latency/cost tradeoffs, failure modes, and how you would debug incorrect answers in production. OpenAI cares because enterprise assistants must be useful, secure, observable, and resilient under real-world constraints like stale documents, permission boundaries, long-tail queries, and high `p99` latency pressure.

Core knowledge

RAG architecture usually has four runtime stages: query understanding, retrieval, context construction, and generation. A strong design separates the retriever, optional reranker, policy/filtering layer, and LLM gateway so each can be scaled, monitored, and swapped independently.
Document ingestion for retrieval means converting source docs into normalized text, metadata, chunks, and embeddings. For SWE interviews, focus on contracts and correctness: stable document IDs, versioning, deletion propagation, permission metadata, and idempotent re-indexing rather than low-level ETL orchestration.
Chunking strategy is a major quality lever. Fixed windows of 300–800 tokens with 10–20% overlap are simple; semantic chunking by headings, paragraphs, or code blocks often improves precision. Too-small chunks lose context; too-large chunks waste context window and dilute embedding similarity.
Embeddings map text to vectors where semantic similarity is approximated by cosine similarity or dot product. For normalized vectors, $\cos(a,b)=a \cdot b$ . Store vectors with metadata such as `doc_id`, `section_id`, `owner_group`, `updated_at`, and `source_uri`.
Approximate nearest neighbor (ANN) indexes make vector search practical. `HNSW` gives strong recall/latency for millions of vectors with memory overhead; `IVF-PQ` in `FAISS` compresses better for tens or hundreds of millions but can reduce recall. Brute-force search is acceptable only for small corpora or offline evaluation.
Hybrid retrieval combines dense vectors with lexical search such as `BM25`. Dense search handles paraphrases; lexical search handles exact identifiers like `SOC2`, `Q4-2024`, API names, or error codes. A common approach retrieves top-k from both and merges via reciprocal rank fusion: $score(d)=\sum_i \frac{1}{k + rank_i(d)}$ .
Reranking improves precision after broad retrieval. Retrieve 50–200 candidates cheaply, then use a cross-encoder or LLM-based scorer to select the top 5–20 passages. This adds latency and cost, so gate it by query type, cache results, or use a lighter reranker for interactive paths.
Context assembly is not just “paste top chunks into the prompt.” Deduplicate adjacent chunks, preserve source titles and timestamps, fit a token budget, and include citations. Prefer diversity across documents when evidence is scattered; prefer contiguous expansion when one document contains the answer.
Authorization must happen before generation, not after. Apply ACL filtering at retrieval time using user identity, groups, document labels, and tenant boundaries. Never retrieve unauthorized chunks and rely on the model to ignore them; that creates leakage risk through summaries, citations, or prompt injection.
Freshness and deletions are product-critical. An enterprise assistant must reflect document updates and removals within a defined SLA, for example “new docs searchable within 5 minutes, revoked docs unavailable within 60 seconds.” Use document version metadata so citations point to the exact retrieved revision.
Latency and cost budgets should be explicit. A plausible target might be `p50 < 2s`, `p95 < 6s`, and strict timeout fallbacks: lexical-only retrieval, fewer reranked candidates, streaming generation, or “I found relevant docs but need more time.” Track model tokens because context bloat drives both cost and latency.
Evaluation and observability need both system and answer-quality signals. Log retrieval hit rate, top-k recall on golden queries, citation coverage, refusal rate, hallucination reports, `p95` latency per stage, cache hit rate, and cost per answer. Debugging usually starts by asking: did retrieval fail, reranking fail, or generation ignore evidence?

Tip: In design interviews, name the failure boundary: “If the answer is wrong, I want traces showing query, candidate chunks, ACL decisions, reranker scores, prompt, citations, and model output.”

Worked example

For Design an enterprise RAG assistant for internal docs, a strong candidate would start by clarifying the corpus size, document sources, access-control model, freshness requirements, expected query volume, latency target, and whether answers require citations. They might state assumptions: “Assume 10 million chunks across `Google Drive`, `Confluence`, and code docs; users authenticate through SSO; permissions are group-based; and answers must cite sources.” The answer can then be organized around five pillars: indexing pipeline, retrieval/reranking path, generation and citation strategy, security/permissions, and production operations.

For indexing, describe parsing documents, chunking by structure, embedding chunks, and storing vectors plus metadata in a vector index such as `FAISS`, `Milvus`, `Pinecone`, or `pgvector` depending on scale. For query serving, embed the user query, apply ACL filters, perform hybrid retrieval, rerank candidates, construct a compact context, and call the LLM with instructions to answer only from provided evidence. A key tradeoff to flag is filtering before versus after ANN search: pre-filtering by ACL is safer but may reduce ANN performance if filters are highly selective; post-filtering is faster in some indexes but risks poor recall and must never expose filtered content to the model. Close by explaining how you would monitor and iterate: golden query sets, stage-level tracing, latency dashboards, user feedback, and red-team tests for prompt injection and permission leakage. If you had more time, add multi-hop retrieval, query rewriting, table-aware parsing, and offline evaluation for retrieval recall.

A second angle

A closely related variant is designing a RAG assistant over customer-support tickets, product documentation, and incident runbooks. The same core architecture applies, but the constraints shift toward freshness, exact error-message matching, and escalation behavior when the assistant lacks confidence. Hybrid retrieval becomes more important because users often paste stack traces, ticket IDs, or version numbers that dense embeddings may blur. The system also needs stronger source ranking rules: official docs and current runbooks should outrank old tickets unless the query explicitly asks for historical examples. The best answer still decomposes retrieval, filtering, context construction, and generation, but emphasizes operational correctness over broad enterprise knowledge coverage.

Common pitfalls

Pitfall: Treating RAG as “put documents in a vector database and call an LLM.”

That answer is too shallow for a system design interview. A better answer explicitly covers chunking, metadata, ACLs, hybrid retrieval, reranking, context budgeting, citations, fallbacks, and observability; the vector database is only one component.

Pitfall: Ignoring security until the end.

Enterprise assistants are mostly defined by permissions, tenant isolation, auditability, and data leakage prevention. Say early that retrieval must be scoped by user identity and document ACLs, and that logs, prompts, caches, and citations also need access-control treatment.

Pitfall: Over-indexing on model quality instead of system behavior.

It is tempting to discuss which LLM is “best” or how to train a retriever in depth, but a SWE interviewer wants architecture and tradeoffs. Keep model selection lightweight and spend more time on interfaces, scaling bottlenecks, latency, correctness, debugging, and failure isolation.

Connections

Interviewers may pivot from this topic into vector search internals, distributed system design, caching and rate limiting, LLM safety, or evaluation of search quality. Be ready to discuss `HNSW`, `BM25`, access-control enforcement, prompt injection defenses, and how to debug a bad answer from logs.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts