Design a Double Descent Experiment
Company: Anthropic
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: HR Screen
You are given a take-home assignment for a mechanistic interpretability / machine learning interview.
**Design an experiment that clearly demonstrates *sample-wise double descent* — double descent in the test error as a function of the sample-to-feature ratio — in a supervised learning problem.** The work should be completable in roughly four hours and summarized in a short slide deck.
Your deliverable must:
1. **Choose a concrete learning setup and data-generating process** (model class, dataset, noise model) that can exhibit the effect cleanly and cheaply.
2. **Sweep the number of training samples $n$ relative to the feature dimension $d$** so that the test-error curve shows the characteristic dip → spike → second descent, with the peak located at the interpolation threshold ($n \approx d$).
3. **Provide a theoretical explanation** for *why* the double-descent behavior appears.
4. **Propose at least one mitigation** for the phenomenon and explain *why* it should help.
5. **Present** the setup, plots, theory, and mitigation clearly in slides.
A simple high-dimensional linear regression setting is acceptable if it cleanly exhibits the effect — you are graded on the clarity of the experiment, the correctness of the explanation, and the soundness of the mitigation, not on using an exotic model.
### Constraints & Assumptions
- **Time budget:** ~4 hours total, including writing the slides. The experiment must run on a single laptop/CPU in minutes, not hours.
- **Scope of "double descent":** specifically the *sample-wise* (a.k.a. *aspect-ratio*) curve — test error vs. $n/d$ (or $d/n$) at fixed model capacity — **not** the model-size / epoch-wise variants. Be explicit about which axis you sweep.
- **What is measured:** generalization error (e.g. test MSE) on a large held-out set, averaged over multiple random seeds.
- You may use synthetic data; a known ground-truth signal makes the bias/variance story exact.
```hint Where to start
You need a setting where you can push past the interpolation threshold ($n < d$) and where the model *exactly fits* the training data there. Ordinary least squares with $n < d$ has infinitely many zero-training-error solutions — which one does the estimator pick, and what makes that choice well-defined?
```
```hint The estimator at the threshold
The spike at $n \approx d$ is not about model capacity per se — it's about *conditioning*. Think about the singular values of the design matrix $X$ and what happens to $(X^\top X)^{-1}$ (or the pseudo-inverse) as $X$ becomes nearly square.
```
```hint Bias–variance and the second descent
Decompose risk into bias + variance + irreducible noise and ask which term explodes near the peak. In the regime past the interpolation threshold ($n < d$, more features than samples), what implicit constraint selects the solution among the infinitely many that fit the training data, and why might that *lower* variance again as $d/n$ grows?
```
```hint Making the curve reproducible
A single seed gives a noisy curve and the peak may not align with $n=d$. Average test error over many seeds per $n$, keep a fixed signal-to-noise ratio, and use a large independent test set so the curve is smooth enough to read.
```
### Clarifying Questions to Ask
- Which flavor of double descent is in scope — sample-wise (vary $n$), model-wise (vary $d$ or width), or epoch-wise? Should the deck address only one?
- Is synthetic data acceptable, or must this be shown on a real dataset?
- Is a closed-form / analytical estimator (e.g. minimum-norm least squares) acceptable, or is a trained neural network expected?
- How rigorous should the theory be — qualitative intuition, a bias–variance argument, or a formal asymptotic (random-matrix) result?
- What is the audience for the slides — ML researchers who want the math, or a broader review panel?
- Is a single mitigation sufficient, or should several be compared?
### What a Strong Answer Covers
- **A reproducible experiment** with a clearly stated data-generating process, a precisely specified estimator, and a swept quantity ($n/d$) that places the peak exactly at the interpolation threshold.
- **Correct identification of the mechanism:** the peak is driven by *variance* blowing up due to ill-conditioning at $n \approx d$, not by model capacity alone.
- **An honest theoretical account** — at minimum a bias/variance decomposition; ideally connecting the variance spike to the conditioning of the design matrix (and, for bonus depth, to a high-dimensional spectral argument).
- **A principled mitigation** with a *mechanistic* justification — one that names *which* part of the failure mode it targets (e.g. the amplified directions, the noise level, or the aspect ratio itself) and explains *why* that helps, rather than just "it usually helps." Bonus depth for distinguishing a fixed-strength fix from an optimally-tuned one.
- **Clean, honest plots:** error bars / multiple seeds, log scale where appropriate, the threshold marked, and the mitigated curve overlaid on the unmitigated one.
- **Awareness of failure modes and limitations:** what would make the curve *not* appear, and what the synthetic result does and does not say about deep nets.
### Follow-up Questions
- You used minimum-norm least squares. What *exactly* is being minimized in the $n < d$ regime, and why does the choice of norm matter for whether you see the second descent?
- How does the height and location of the peak change as you vary the signal-to-noise ratio? What happens in the noiseless limit?
- With ridge, there is an *optimal* $\lambda$ at each $n/d$. If you tune $\lambda$ optimally per point, does double descent disappear entirely? What does that tell you about whether double descent is a fundamental phenomenon or an artifact of under-regularization?
- How would this picture change for a 2-layer neural network or a kernel/random-features model, and which axis would you sweep to see double descent there?
Quick Answer: This question evaluates understanding of sample-wise double descent, experimental design for reproducible supervised-learning studies, and theoretical concepts such as generalization, bias–variance decomposition, and matrix conditioning in the Machine Learning domain, requiring both practical application (designing and running a short experiment) and conceptual understanding (explaining the underlying causes). It is commonly asked because it probes the ability to generate clear empirical evidence of non-monotonic test-error behavior as a function of the sample-to-feature ratio and to connect those observations to rigorous theoretical explanations, reflecting skills important for mechanistic interpretability, robust evaluation, and statistical reasoning.