How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Technical Screen rounds at Netflix.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Netflix during technical interviews.

Compare Losses and Explain LoRA | Netflix Interview Question

Q: Compare Losses and Explain LoRA

This question evaluates understanding of loss functions (mean squared error versus cross-entropy) and parameter-efficient fine-tuning via Low-Rank Adaptation (LoRA), emphasizing probabilistic interpretations, optimization behavior, and low-rank parameterization.

ML Fundamentals: Loss Functions and Low-Rank Adaptation

This is a rapid-fire ML fundamentals screen. You are expected to reason precisely about loss functions and parameter-efficient fine-tuning, not just recite definitions. Answer both parts below with the depth an interviewer would probe for.

Constraints & Assumptions

This is a conceptual/whiteboard screen — no coding required for these two parts. Equations and clear reasoning are expected.
Assume the LoRA discussion targets a standard transformer (attention + MLP blocks).
"Parameter-efficient" is measured by the count of trainable parameters and the optimizer-state memory they require, not raw FLOPs of the forward pass.

Clarifying Questions to Ask

Is the Part 1 comparison meant for regression specifically, or do you also want me to discuss what goes wrong when MSE is used on a classification problem ?
Do you want full derivations (gradients, parameter counts) on the board, or higher-level intuition for why each method behaves the way it does?
For Part 2, should I assume a default LoRA placement, or discuss which transformer modules to target?
How much practical depth do you want — e.g., rank/ $\alpha$ tuning, merging, and comparisons to other PEFT methods — versus just the core mechanism?

Part 1 — Mean Squared Error vs. Cross-Entropy

Compare mean squared error (MSE) loss and cross-entropy (CE) loss. Address:

When each is appropriate, and the kind of task each targets.
The probabilistic assumption that each loss corresponds to (i.e., what likelihood you are implicitly maximizing).
How their optimization behavior differs — in particular, why cross-entropy is preferred over MSE for classification with a sigmoid/softmax output.

What This Part Should Cover

Task fit : correctly pairs each loss with the kind of task and target it suits (continuous regression vs. categorical classification).
Likelihood framing : connects each loss to a maximum-likelihood objective under an assumed output/noise distribution (Gaussian for MSE, Bernoulli/categorical for CE), rather than treating it as an arbitrary distance.
Gradient argument : derives the logit-gradient for both losses on a sigmoid/softmax head and uses it to explain why CE is preferred — especially the confidently-wrong / saturation case — rather than just asserting it.
Loss-surface intuition : notes the convexity / flat-region difference and why it makes MSE-on-sigmoid slow or unstable.

Part 2 — Low-Rank Adaptation (LoRA)

Explain Low-Rank Adaptation (LoRA). Address:

The core idea and the math: how the weight update is parameterized.
How it is applied when fine-tuning a large neural network or LLM (which layers, what is frozen vs. trained, what happens at inference).
Why it is parameter-efficient , and roughly how the trainable parameter count compares to full fine-tuning.

What This Part Should Cover

Mechanism : states what is frozen vs. trained and how the update is structured (a product of two thin matrices) to be cheap — not merely names the method.
Math & initialization : writes $\Delta W$ as a scaled low-rank product, explains the rank choice, and gives the zero-init detail that makes the adapter start as a no-op.
Efficiency, quantified : quantifies the trainable-parameter reduction (ideally with a concrete count) and identifies optimizer-state / memory savings as the real win, not forward FLOPs.
Inference handling : addresses how the adapter is treated at inference (merging vs. keeping it separate) and the latency / multi-task serving trade-off.
Honest limitations : names regimes where a low-rank update falls short of full fine-tuning.

Follow-up Questions

Derive $\frac{\partial \mathcal{L}}{\partial z}$ for both losses on a single sigmoid output and explain precisely why MSE's gradient can stall.
For a $4096 \times 4096$ attention projection with $r = 8$ , what fraction of that layer's parameters does LoRA train?
What do the LoRA hyperparameters $r$ , $\alpha$ , and dropout control, and how would you tune them?
How does LoRA compare to other PEFT methods such as adapters, prefix-tuning, or QLoRA, and when would you reach for each?

ML Fundamentals: Loss Functions and Low-Rank Adaptation

Constraints & Assumptions

This is a conceptual/whiteboard screen — no coding required for these two parts. Equations and clear reasoning are expected.
Assume the LoRA discussion targets a standard transformer (attention + MLP blocks).
"Parameter-efficient" is measured by the count of trainable parameters and the optimizer-state memory they require, not raw FLOPs of the forward pass.

Clarifying Questions to Ask

Is the Part 1 comparison meant for regression specifically, or do you also want me to discuss what goes wrong when MSE is used on a classification problem ?
Do you want full derivations (gradients, parameter counts) on the board, or higher-level intuition for why each method behaves the way it does?
For Part 2, should I assume a default LoRA placement, or discuss which transformer modules to target?
How much practical depth do you want — e.g., rank/ $\alpha$ tuning, merging, and comparisons to other PEFT methods — versus just the core mechanism?

Part 1 — Mean Squared Error vs. Cross-Entropy

Compare mean squared error (MSE) loss and cross-entropy (CE) loss. Address:

When each is appropriate, and the kind of task each targets.
The probabilistic assumption that each loss corresponds to (i.e., what likelihood you are implicitly maximizing).
How their optimization behavior differs — in particular, why cross-entropy is preferred over MSE for classification with a sigmoid/softmax output.

What This Part Should Cover

Task fit : correctly pairs each loss with the kind of task and target it suits (continuous regression vs. categorical classification).
Likelihood framing : connects each loss to a maximum-likelihood objective under an assumed output/noise distribution (Gaussian for MSE, Bernoulli/categorical for CE), rather than treating it as an arbitrary distance.
Gradient argument : derives the logit-gradient for both losses on a sigmoid/softmax head and uses it to explain why CE is preferred — especially the confidently-wrong / saturation case — rather than just asserting it.
Loss-surface intuition : notes the convexity / flat-region difference and why it makes MSE-on-sigmoid slow or unstable.

Part 2 — Low-Rank Adaptation (LoRA)

Explain Low-Rank Adaptation (LoRA). Address:

The core idea and the math: how the weight update is parameterized.
How it is applied when fine-tuning a large neural network or LLM (which layers, what is frozen vs. trained, what happens at inference).
Why it is parameter-efficient , and roughly how the trainable parameter count compares to full fine-tuning.

What This Part Should Cover

Mechanism : states what is frozen vs. trained and how the update is structured (a product of two thin matrices) to be cheap — not merely names the method.
Math & initialization : writes $\Delta W$ as a scaled low-rank product, explains the rank choice, and gives the zero-init detail that makes the adapter start as a no-op.
Efficiency, quantified : quantifies the trainable-parameter reduction (ideally with a concrete count) and identifies optimizer-state / memory savings as the real win, not forward FLOPs.
Inference handling : addresses how the adapter is treated at inference (merging vs. keeping it separate) and the latency / multi-task serving trade-off.
Honest limitations : names regimes where a low-rank update falls short of full fine-tuning.

Follow-up Questions

Derive $\frac{\partial \mathcal{L}}{\partial z}$ for both losses on a single sigmoid output and explain precisely why MSE's gradient can stall.
For a $4096 \times 4096$ attention projection with $r = 8$ , what fraction of that layer's parameters does LoRA train?
What do the LoRA hyperparameters $r$ , $\alpha$ , and dropout control, and how would you tune them?
How does LoRA compare to other PEFT methods such as adapters, prefix-tuning, or QLoRA, and when would you reach for each?

Compare Losses and Explain LoRA

Quick Overview

Compare Losses and Explain LoRA

ML Fundamentals: Loss Functions and Low-Rank Adaptation

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Mean Squared Error vs. Cross-Entropy

What This Part Should Cover

Part 2 — Low-Rank Adaptation (LoRA)

What This Part Should Cover

Follow-up Questions

Write your answer

Compare Losses and Explain LoRA

Quick Overview

Compare Losses and Explain LoRA

ML Fundamentals: Loss Functions and Low-Rank Adaptation

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Mean Squared Error vs. Cross-Entropy

What This Part Should Cover

Part 2 — Low-Rank Adaptation (LoRA)

What This Part Should Cover

Follow-up Questions

Write your answer