Explain self-attention, LoRA, Adam vs SGD, ViT
Company: Netflix
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
Answer the following ML/Deep Learning interview questions:
1) **Describe self-attention** in Transformer models. What are the queries, keys, and values, and how is the attention output computed?
2) **Why are attention logits divided by \(\sqrt{d_k}\)** (where \(d_k\) is the key/query dimension) before the softmax?
3) **Describe LoRA (Low-Rank Adaptation)** for fine-tuning large models. How does it modify the weight update during fine-tuning, and what are its main benefits?
4) **Why does LoRA often reduce GPU memory consumption** compared to full fine-tuning?
5) **What is the difference between Adam and SGD** (including SGD with momentum)? When might you prefer one over the other?
6) **Compare Vision Transformers (ViT) and CNNs**. What are the main pros and cons of each?
7) **What factors influence the choice of ViT patch size** (e.g., 8×8 vs 16×16 vs 32×32), and what are the trade-offs?
Quick Answer: This question evaluates understanding of modern Machine Learning/Deep Learning topics, including self-attention mechanics (queries, keys, values and scaled logits), Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning and memory savings, optimizer behavior (Adam versus SGD with momentum), and architectural trade-offs between Vision Transformers and CNNs including patch-size considerations. It is categorized under Machine Learning and is commonly asked because it probes both conceptual understanding and practical application—testing reasoning about training dynamics, model scaling, fine-tuning strategies, and resource/performance trade-offs.