LoRA Parameter-Efficient Fine-Tuning
Asked of: Machine Learning Engineer
Last updated
What's being tested
Candidates must show practical mastery of parameter-efficient fine-tuning (how and why adapters like LoRA change training/serving tradeoffs), plus the probabilistic and optimization implications of choosing losses like mean squared error versus cross-entropy. Interviewers are probing whether you can (a) pick the right loss and optimizer given model/metric requirements, (b) design and tune a LoRA-based training run (rank, placement, LR, weight decay, merging strategy), and (c) reason end-to-end about offline evaluation, calibration, and deployment constraints that an MLE owns.
Core knowledge
-
LoRA (Low-Rank Adaptation): injects trainable low-rank matrices A (d×r) and B (r×k) into a frozen weight update W: W' = W + α/ r · BA; only A,B trained. Typical ranks r ∈ [4,256]; parameter savings 10–100× vs full fine-tune.
-
Where to place LoRA: common targets are attention projection matrices (Q/K/V/output) and large dense layers (transformer MLP). For
ViT, prefer attention Q/K/V and the MLP FC layers where param count and representational power concentrate. -
Loss choice — probabilistic view: Cross-entropy for categorical outcomes corresponds to maximizing log-likelihood of a categorical/softmax model:
MSE corresponds to Gaussian likelihood with homoscedastic variance:
Use CE for classification/calibration; MSE for regression where squared error is the evaluation metric.
-
Gradient & optimization effects: cross-entropy gradients scale with prediction confidence (softmax logits) and avoid plateauing for misclassified examples; MSE gives symmetric gradients and can be dominated by outliers. This affects LoRA updates because LoRA's low-rank capacity must carry corrective signals the loss provides.
-
Optimizer choice:
AdamWis usually preferred for adapter-style fine-tuning—fast convergence, stable with small parameter subsets, and decoupled weight decay;SGDwith momentum can give better generalization in some full-finetune regimes but is slower and requires LR schedules (warmup + cosine/step). -
Hyperparameters specific to LoRA: initialize A or B to zeros or small values to avoid sudden distribution shifts; apply scaling α and choose LR higher than base-model LR because only few params update; often disable weight decay on LoRA matrices.
-
Checkpointing & serving: persist LoRA-only checkpoints (A,B + metadata) instead of full model; for inference, either (a) merge adapters into base weights (W ← W + α/ r · BA) for single-model inference, or (b) load adapters at runtime for multi-experiment serving to avoid duplicating base-model memory.
-
Memory/compute tradeoffs: training memory reduced by freezing base model, but attention activation memory remains. LoRA reduces trainable params and optimizer state storage by ~O(r·(d+k)) but does not remove activation memory for backprop unless using gradient checkpointing.
-
Evaluation signals MLEs own: track offline accuracy/AUC and calibration (ECE), latency (CPU/GPU), throughput, and model-size per-replica. For adapters, add deployment metrics: merge-time cost, adapter load time, and rollout A/B metrics for personalization.
-
Failure modes: low-rank insufficient for large domain shifts; LoRA can underfit if r too small or if placed on wrong layers; catastrophic forgetting less of a problem since base frozen, but miscalibrated outputs and distributional shift remain.
Tip: When tuning, run ablations over (rank r, α scaling, LR multiplier for adapters, and placement) on a held-out slice that matters for production (e.g., long-tail content types).
Worked example — Compare Losses and Explain LoRA
First 30s: clarify target metric (classification vs regression), whether the base model is frozen, and the deployment constraints (adapter storage vs merged-inference). Frame response around three pillars: (1) probabilistic interpretation and gradient behavior of cross-entropy vs MSE, (2) how LoRA mechanics interact with chosen loss and optimizer, and (3) practical hyperparameter/operational tradeoffs for training and serving. Explain CE maximizes categorical log-likelihood and gives gradients that focus on softmax confidence tails, while MSE assumes Gaussian noise and is sensitive to outliers; mention calibration and when you’d use each (CE for label prediction, MSE for numeric targets). For LoRA, describe insertion into attention projection matrices, typical rank choices, α scaling, and how training only low-rank matrices changes optimizer choices—prefer AdamW with a relatively larger LR and disabled weight decay on adapter params. Flag a specific tradeoff: unmerged adapters let you host one base model with many adapter variants (memory-efficient for experiments) but add runtime complexity and slight loading latency; merging simplifies serving but duplicates model weights per variant. Close by saying: if time allowed, propose an ablation plan (rank sweep, placement sweep, LR schedule) and metrics to monitor (ECE, per-slice recall, adapter load latency).
A second angle — Explain self-attention, LoRA, Adam vs SGD, ViT
Here the focus shifts toward architecture and optimizer interactions. Start by describing self-attention as computing QK^T/√d then softmax to weight values; that clarifies why LoRA on Q/K/V projections can change attention patterns even with frozen base weights. For ViT, patch embeddings and class token behavior make MLP heads and attention heads prime LoRA targets—adapting Q/K can reweight patch interactions without full re-training. Discuss optimizer dynamics: AdamW handles sparse, high-magnitude adapter gradients well and converges rapidly for small parameter subsets; SGD may require longer schedules and can yield better generalization but is rarely used for adapter-only training in production due to iteration cost. Conclude by highlighting deployment implications: adapter placement affects inference FLOPs and whether merging will preserve desired representational changes.
Common pitfalls
Pitfall: Treating loss choice as purely empirical.
Choosing MSE for a multiclass classification task because it “works sometimes” ignores calibration and learning dynamics; explain the likelihood interpretation and prefer cross-entropy unless problem truly is regression.
Pitfall: Forgetting optimizer/weight-decay interactions.
Tuning LoRA with default global weight decay can decay tiny adapter weights into ineffectiveness; state that you’ll disable/adjust weight decay for adapter params and use a higher LR multiplier.
Pitfall: Ignoring inference merging costs.
Saying “we’ll just use adapters” without discussing merge vs dynamic loading misses serving realities—detail memory vs latency tradeoffs and checkpointing strategy.
Connections
Adapters and LoRA live in the broader family of parameter-efficient methods like prefix tuning, prompt tuning, and BitFit—interviewer may pivot to comparing these. They may also move toward quantization or distillation to reduce inference cost after fine-tuning, or to optimizer/regularization topics like learning-rate schedules and mixed-precision training.
Further reading
-
LoRA: Low-Rank Adaptation of Large Language Models — Hu et al., 2021 — original paper describing math, rank/α choices, and experiments.
-
Attention Is All You Need — Vaswani et al., 2017 — for self-attention mechanics and where adapters intervene.
-
Decoupled Weight Decay Regularization (AdamW) — Loshchilov & Hutter — explains why
AdamWis preferred for adapter-style fine-tuning.
Practice questions
Related concepts
- RLHF And Preference Optimization Basics
- Low-Latency/Batch Inference and GPU Resource Management
- Transformer Architecture And LLM LifecycleMachine Learning
- Distributed Training And LLM Fine-Tuning PlatformsML System Design
- ML Fundamentals: Backprop, Attention, And RLMachine Learning
- Real-Time Edge Inference OptimizationML System Design