Compare Losses and Explain LoRA
Company: Netflix
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
## ML Fundamentals: Loss Functions and Low-Rank Adaptation
This is a rapid-fire ML fundamentals screen. You are expected to reason precisely about loss functions and parameter-efficient fine-tuning, not just recite definitions. Answer both parts below with the depth an interviewer would probe for.
### Constraints & Assumptions
- This is a conceptual/whiteboard screen — no coding required for these two parts. Equations and clear reasoning are expected.
- Assume the LoRA discussion targets a standard transformer (attention + MLP blocks).
- "Parameter-efficient" is measured by the count of *trainable* parameters and the optimizer-state memory they require, not raw FLOPs of the forward pass.
### Clarifying Questions to Ask
- Is the Part 1 comparison meant for regression specifically, or do you also want me to discuss what goes wrong when MSE is used *on a classification problem*?
- Do you want full derivations (gradients, parameter counts) on the board, or higher-level intuition for why each method behaves the way it does?
- For Part 2, should I assume a default LoRA placement, or discuss which transformer modules to target?
- How much practical depth do you want — e.g., rank/$\alpha$ tuning, merging, and comparisons to other PEFT methods — versus just the core mechanism?
### Part 1 — Mean Squared Error vs. Cross-Entropy
Compare **mean squared error (MSE)** loss and **cross-entropy (CE)** loss. Address:
- When each is appropriate, and the kind of task each targets.
- The probabilistic assumption that each loss corresponds to (i.e., what likelihood you are implicitly maximizing).
- How their **optimization behavior** differs — in particular, why cross-entropy is preferred over MSE for classification with a sigmoid/softmax output.
```hint Where to start
Each loss is a negative log-likelihood under some assumed output distribution. Ask yourself: if you minimize this loss, which distribution are you implicitly fitting the labels to? One family is continuous; the other is discrete.
```
```hint Optimization behavior
Work out the gradient of the loss with respect to the **pre-activation logit** $z$ for a sigmoid output. Compare $\frac{\partial \,\text{MSE}}{\partial z}$ vs $\frac{\partial \,\text{CE}}{\partial z}$. Look at what happens when the model is *confidently wrong* ($p \to 0$ but $y = 1$) — does the gradient vanish or stay strong?
```
```hint Convexity / surface
Consider whether each loss is convex in the model's weights for a linear classifier, and what composing the sigmoid with a squared error does to that surface. This is part of why MSE-on-sigmoid trains slowly.
```
#### What This Part Should Cover
- **Task fit**: correctly pairs each loss with the kind of task and target it suits (continuous regression vs. categorical classification).
- **Likelihood framing**: connects each loss to a maximum-likelihood objective under an assumed output/noise distribution (Gaussian for MSE, Bernoulli/categorical for CE), rather than treating it as an arbitrary distance.
- **Gradient argument**: derives the logit-gradient for both losses on a sigmoid/softmax head and uses it to explain *why* CE is preferred — especially the confidently-wrong / saturation case — rather than just asserting it.
- **Loss-surface intuition**: notes the convexity / flat-region difference and why it makes MSE-on-sigmoid slow or unstable.
### Part 2 — Low-Rank Adaptation (LoRA)
Explain **Low-Rank Adaptation (LoRA)**. Address:
- The core idea and the math: how the weight update is parameterized.
- How it is applied when fine-tuning a large neural network or LLM (which layers, what is frozen vs. trained, what happens at inference).
- Why it is **parameter-efficient**, and roughly how the trainable parameter count compares to full fine-tuning.
```hint Where to start
Don't fine-tune $W$ directly. What if you froze the original weights and only learned the *update* $\Delta W$ — and what structural constraint could you put on $\Delta W$ so that it has far fewer free parameters than $W$ itself?
```
```hint The math
A low-rank matrix can be written as a product of two thin factors. Express the update that way, pick a rank, and reason about how that rank determines the trainable-parameter count relative to the full matrix. Is there a scalar you'd add to keep the effective update magnitude stable as you change the rank?
```
```hint Initialization & inference
At the very start of fine-tuning, you want the adapted model to behave identically to the pretrained one — what does that imply about how the two factors should be initialized? And at inference, ask whether the learned update can be folded back into the original weights so there's no extra latency.
```
#### What This Part Should Cover
- **Mechanism**: states what is frozen vs. trained and how the update is structured (a product of two thin matrices) to be cheap — not merely names the method.
- **Math & initialization**: writes $\Delta W$ as a scaled low-rank product, explains the rank choice, and gives the zero-init detail that makes the adapter start as a no-op.
- **Efficiency, quantified**: quantifies the trainable-parameter reduction (ideally with a concrete count) and identifies optimizer-state / memory savings as the real win, not forward FLOPs.
- **Inference handling**: addresses how the adapter is treated at inference (merging vs. keeping it separate) and the latency / multi-task serving trade-off.
- **Honest limitations**: names regimes where a low-rank update falls short of full fine-tuning.
### Follow-up Questions
- Derive $\frac{\partial \mathcal{L}}{\partial z}$ for both losses on a single sigmoid output and explain precisely why MSE's gradient can stall.
- For a $4096 \times 4096$ attention projection with $r = 8$, what fraction of that layer's parameters does LoRA train?
- What do the LoRA hyperparameters $r$, $\alpha$, and dropout control, and how would you tune them?
- How does LoRA compare to other PEFT methods such as adapters, prefix-tuning, or QLoRA, and when would you reach for each?
Quick Answer: This question evaluates understanding of loss functions (mean squared error versus cross-entropy) and parameter-efficient fine-tuning via Low-Rank Adaptation (LoRA), emphasizing probabilistic interpretations, optimization behavior, and low-rank parameterization.