Derive and compare core ML and RL methods
Company: Amazon
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
Answer the following ML fundamentals rigorously—state assumptions, give equations, and justify trade‑offs. 1) Derive full‑batch gradient descent (GD) and stochastic gradient descent (SGD) updates for L(w)= (1/n)∑_{i=1..n} ℓ_i(w). Compare convergence, gradient variance, and wall‑clock efficiency; explain when SGD outperforms GD. 2) Define batch size. With n=50,000 samples, epochs=5, batch size b=200, compute update steps per epoch and total. If b increases to 2,000, compute new steps and propose a learning‑rate adjustment via linear scaling; explain when this rule fails. 3) Classify algorithms as supervised vs unsupervised and name one use‑case each: logistic regression, SVM, k‑NN, k‑means, PCA, t‑SNE, Isolation Forest. 4) Relate RL to supervised/unsupervised learning. Write the REINFORCE gradient ∇θ J = E[∑_t ∇θ log πθ(a_t|s_t) G_t] and show how a baseline b_t keeps the estimator unbiased while reducing variance; express the gradient for a length‑3 trajectory with returns G=[3,1,−1] and score‑function terms g1,g2,g3 using a constant baseline b=mean(G). 5) Explain how neural networks are integrated into RL (e.g., DQN, policy gradient, actor‑critic). For DQN, describe why target networks and experience replay stabilize training and a failure mode without them. 6) Contrast Transformers vs RNNs: parallelism, long‑range dependency handling, and complexity. For sequence length n=1024 and model dim d=512, estimate the asymptotic time/memory cost of self‑attention and name two techniques that mitigate quadratic scaling. 7) Define embeddings and polysemy. Propose a method to distinguish “King” in chess vs monarchy contexts using contextual encoders or multi‑sense embeddings; outline an intrinsic (WSD) and extrinsic (downstream accuracy) evaluation. 8) With a single 24‑GB GPU and a 7B model, design a low‑compute fine‑tuning plan (e.g., QLoRA/adapters, 4‑bit quantization, gradient checkpointing, mixed precision). Choose a LoRA rank r and specify batch size, sequence length, optimizer, and learning‑rate schedule. Provide a back‑of‑envelope estimate of trainable parameters using hidden size ≈4096 and ~32 layers; state any assumptions about which projections you adapt.
Quick Answer: This question evaluates a candidate's mastery of core machine learning and reinforcement learning concepts, including optimization (gradient methods and batch‑size trade‑offs), supervised versus unsupervised algorithms, policy‑gradient RL and variance reduction, deep RL stability techniques, sequence model complexity (Transformers vs RNNs), embeddings and polysemy, and low‑compute fine‑tuning strategies. It is commonly asked to probe both theoretical understanding and practical engineering judgment about convergence, variance, computational and memory complexity, representation learning, and resource‑constrained model adaptation; the domain is Machine Learning and the assessment spans conceptual understanding and practical application.