LLM Product Understanding, RAG, And Fine-Tuning — Tech Interview Concept

What's being tested
Ability to design and evaluate LLM-powered features that combine retrieval and model adaptation: choosing retrieval architecture, embedding/index tradeoffs, fine‑tuning strategy, latency/cost constraints, and evaluation metrics to reduce hallucination and improve relevance.
Core knowledge

Retrieval-augmented generation: retrieve-then-generate pipeline, grounding vs. context-window limits.
Embeddings & vector DBs: dense (DPR/SBERT) vs sparse (BM25); FAISS, Milvus, Pinecone tradeoffs.
Reranking: cross-encoder rerankers to boost precision@k at cost of CPU.
Chunking: document chunk size, overlap, and passage-level vs doc-level retrieval impact recall.
Fine-tuning methods: SFT, instruction tuning, LoRA/QLoRA, adapters, full-parameter FT tradeoffs.
Evaluation metrics: retrieval precision@k, recall, MAP, generation ROUGE/F1, hallucination rate, latency, cost-per-query.
Safety/freshness: PII filtering, metadata filtering, incremental index updates, staleness mitigation.

Worked example — "Design a RAG system for customer support"
Frame it by stakeholder goals: reduce average handle time and decrease incorrect answers (hallucinations) measured by resolution rate and hallucination rate. Decompose into components: knowledge sources (KB, tickets, product docs), document preprocessing (chunk size, metadata), embedding choice (SBERT for semantic search; evaluate OpenAI vs open weights), index selection (FAISS for on-premise or Pinecone for managed), retrieval pipeline (dense retrieval + BM25 hybrid), reranker (cross-encoder on top-k), and generator (base LLM with grounding prompts). Define evaluation: offline retrieval recall@k, end-to-end accuracy on labeled queries, latency SLA, and A/B test for business metrics (CSAT, handle time). Plan iteration: start with retrieval tuning before any fine-tuning; use SFT or LoRA only if generator consistently errs on grounded content.
A common pitfall
Treating fine-tuning as the primary fix for hallucinations. Tempting to retrain the model to "know" facts, but most errors stem from retrieval failures or prompt design. FT increases cost, introduces maintenance and drift, and often masks mis-indexing or stale data. Prioritize index quality, retrieval architecture, and grounding checks before heavy parameter tuning.
Further reading

Lewis, P., et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP" (2020).
Hu, E. J., et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2021).

Related concepts