Technical Screen: Transformers, Attention, Decoding, RLHF, Evaluation, and Optimization
Context: Assume a modern decoder-only LLM unless stated otherwise. Address each prompt concisely but precisely, highlighting trade-offs and practical considerations.
-
Transformer architecture and self-attention
-
Explain the Transformer block and how scaled dot-product self-attention works.
-
Analyze computational complexity (time/memory) with respect to sequence length n and hidden size d.
-
Explain why multi-head attention helps.
-
Attention types and use cases
-
Differentiate self-attention vs. cross-attention.
-
Define scaled dot-product attention and note alternatives.
-
Explain when each attention type is used.
-
Causal (autoregressive) decoding and masks
-
Define causal decoding.
-
Show how attention masks enforce causality.
-
Decoding and sampling strategies
-
Compare greedy, temperature, top-k, nucleus (top-p), and beam search.
-
Explain trade-offs in quality, diversity, and latency; give practical tuning guidance.
-
RL-based fine-tuning of LLMs
-
Describe the RLHF pipeline: preference data, reward modeling, PPO (or alternatives), and KL control.
-
Discuss common stability challenges and mitigations.
-
Evaluation plan for LLMs
-
Propose automatic metrics (e.g., perplexity, accuracy), task-based evaluation, human evaluation, and safety/robustness tests.
-
Explain how to avoid data leakage and ensure statistical significance.
-
Optimization techniques for training and inference
-
Cover optimizers and LR schedules, mixed precision, gradient checkpointing, parameter-efficient finetuning (e.g., LoRA), and distributed strategies (DP/TP/ZeRO/FSDP).
-
Include key inference optimizations (KV cache, quantization, speculative decoding).