LLM Fundamentals — Onsite Interview Task
Context: Assume a modern transformer-based LLM. Provide precise, concise explanations with examples and trade-offs.
-
Subword tokenization (e.g., BPE): How does it work and why is it used?
-
Self-attention: Explain the mechanism and its O(n^2) cost. Discuss techniques to reduce it (e.g., sparsity, sliding windows, KV cache).
-
Contrast pretraining, instruction tuning, and RLHF/DPO.
-
Describe a RAG architecture. Compare indexing choices (BM25 vs dense), chunking strategies, and embeddings. Explain how retrieval quality affects generation.
-
When do you use prompting vs fine-tuning vs adapters (e.g., LoRA)?
-
Low-latency inference: Explain quantization, KV caching, and batching.
-
How do you evaluate LLMs (task-specific metrics, human eval) and mitigate hallucinations and safety risks?