ML Fundamentals Technical Screen — Multi‑part Question
Context: You are given a set of core machine learning topics to address rigorously. For each part, state assumptions, give equations, reason about trade‑offs, and compute requested quantities.
-
Gradient methods
-
Given an empirical risk L(w) = (1/n) ∑_{i=1..n} ℓ_i(w), derive the update rules for:
a) Full‑batch gradient descent (GD)
b) Stochastic gradient descent (SGD) and mini‑batch SGD
-
Compare convergence properties, gradient variance, and wall‑clock efficiency. Explain when SGD outperforms GD.
-
Batch size and steps
-
Define batch size.
-
With n = 50,000 samples, epochs = 5:
a) For batch size b = 200, compute updates per epoch and total updates.
b) For b = 2,000, compute new steps and propose a learning‑rate adjustment via the linear scaling rule. Explain when this rule fails or needs modification.
-
Supervised vs. unsupervised
-
Classify each algorithm and give one use case:
logistic regression, SVM, k‑NN, k‑means, PCA, t‑SNE, Isolation Forest.
-
Reinforcement learning and policy gradients
-
Relate RL to supervised and unsupervised learning.
-
Write the REINFORCE gradient ∇θ J = E[∑_t ∇θ log πθ(a_t|s_t) G_t]. Show how a baseline b_t keeps the estimator unbiased while reducing variance.
-
For a length‑3 trajectory with returns G = [3, 1, −1] and score‑function terms g1, g2, g3, use a constant baseline b = mean(G) to express the sample gradient.
-
Deep RL integrations
-
Explain how neural networks are used in RL (e.g., DQN, policy gradient, actor‑critic).
-
For DQN, describe why target networks and experience replay stabilize training, and name a failure mode without them.
-
Transformers vs. RNNs
-
Contrast parallelism, handling of long‑range dependencies, and complexity.
-
For sequence length n = 1024 and model dimension d = 512, estimate asymptotic time and memory costs of self‑attention. Name two techniques that mitigate quadratic scaling.
-
Embeddings and polysemy
-
Define embeddings and polysemy.
-
Propose a method to distinguish the word ‘King’ in chess vs. monarchy contexts using contextual encoders or multi‑sense embeddings.
-
Outline one intrinsic evaluation (e.g., word sense disambiguation) and one extrinsic evaluation (e.g., downstream task accuracy).
-
Low‑compute fine‑tuning plan (7B model, single 24‑GB GPU)
-
Design a low‑compute fine‑tuning approach (e.g., QLoRA or adapters, 4‑bit quantization, gradient checkpointing, mixed precision).
-
Choose a LoRA rank r and specify batch size, sequence length, optimizer, and learning‑rate schedule.
-
Provide a back‑of‑the‑envelope estimate of trainable parameters using hidden size ≈ 4096 and ~32 layers. State assumptions about which projections you adapt.