Implement 1NN with NumPy
Company: OpenAI
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
Implement a **1-nearest-neighbor (1NN) classifier from scratch using NumPy**, then show that the same decision can be expressed as a neural-network-style computation with fixed weights and biases.
You are given:
- `X_train`: a NumPy array of shape `(n_train, d)` containing training feature vectors.
- `y_train`: a NumPy array of shape `(n_train,)` containing the corresponding labels.
- `X_test`: a NumPy array of shape `(n_test, d)` containing query (test) feature vectors.
The classifier predicts each test example's label as the label of its single closest training example under **squared Euclidean distance**.
### Constraints & Assumptions
- Use **squared Euclidean distance** $\lVert x - t_i \rVert^2$ as the proximity metric (no need to take the square root).
- The implementation must be **vectorized** — no Python-level loops over individual test or training examples.
- **Tie-break rule:** if two or more training examples are equidistant nearest neighbors of a query, return the label of the **earliest** such training example (smallest index in `X_train`).
- Assume `X_train` and `X_test` share the same feature dimension `d`, and `y_train` aligns with `X_train` row-for-row.
### Clarifying Questions to Ask
- Can I assume the inputs are already NumPy arrays of consistent dtype, or should I coerce/validate shapes and dtypes inside the function?
- How large can `n_train`, `n_test`, and `d` get? Does the full `(n_test, n_train)` distance matrix fit in memory, or should I plan to batch the test set?
- Is plain (squared) Euclidean distance the intended metric, or do you also want me to discuss cosine / other metrics?
- For Part 2, do you want literal Keras/PyTorch layers, or just the explicit weight matrix and bias vector plus the readout rule?
- Should I handle degenerate inputs (empty train or test set, duplicate points, NaNs), or can I assume well-formed data?
---
### Part 1 — Vectorized NumPy implementation
Write a vectorized function that takes `X_train`, `y_train`, `X_test` and returns a length-`n_test` array of predicted labels, where each prediction is the label of the nearest training example under squared Euclidean distance. Respect the earliest-index tie-break rule above, and use no Python-level loops over examples.
```hint Where to start
You need the full $(n_{\text{test}}, n_{\text{train}})$ matrix of pairwise squared distances, then an `argmin` along the training axis. Think about how to get every pairwise distance with array operations instead of nested loops.
```
```hint Distance trick
Materializing the difference for every test-train pair forces a 3-D intermediate. Try expanding $\lVert x - t \rVert^2$ algebraically instead — which of the resulting terms is a genuine pairwise interaction, and which depend on only one side? Once you separate them, ask which can be precomputed per row or per column and reused via broadcasting.
```
```hint Tie-break
Look up exactly how `np.argmin` resolves the situation where several entries along the axis tie for the minimum. Then think about the order in which the training examples appear along that axis, and whether the rule you're asked for falls out of that behavior or needs handling on top.
```
#### What This Part Should Cover
- A fully **vectorized** distance computation with **no per-example Python loops**, ideally via the expanded-norm matmul identity that maps onto BLAS.
- An explicit argument for **why `argmin` along the training axis satisfies the earliest-index tie-break** (first-occurrence semantics + columns in training order).
- A correct **time/memory complexity** characterization and the recognition that the full `(n_test, n_train)` score matrix is the memory bottleneck (and a batching strategy when it doesn't fit).
- **Numerical-stability awareness**: the expanded form can cancel catastrophically (tiny negative "distances" for coincident points, flipped near-ties) versus the direct subtract-and-square form — and when each is appropriate.
### Part 2 — As a neural-network computation with fixed weights
Show how the same nearest-neighbor decision can be represented as a neural-network-style computation with **fixed** weights and biases derived directly from `X_train` (no training). Derive the explicit weight matrix and bias vector, and state the readout rule that recovers the predicted label.
```hint Where to start
A 1NN decision is an $\arg\min$ over training examples. Can you turn that into an $\arg\max$ of an expression that is *affine in the query* $x$? An affine map followed by a winner-take-all is exactly one dense layer.
```
```hint Drop a constant
Write out the squared-distance score for a single training point and ask which terms actually depend on *which* training point you're comparing against. Any term that is the same for every candidate can't break the ranking — so it can be discarded before you try to read off a linear form. What's left, and what sign convention turns the "nearest" comparison into a "largest score" one?
```
```hint Read off W and b
Once the score is purely linear in the query $x$ plus a per-training-point constant, line it up against the shape of a dense layer's output, $xW^\top + b$. The query-dependent part tells you what each row of $W$ must be (in terms of a training vector); the constant part tells you the per-unit bias. The predicted label then comes from a fixed lookup keyed on the winning unit's index.
```
#### What This Part Should Cover
- A correct **algebraic reduction** from $\arg\min$ of $\lVert x - t_i\rVert^2$ to $\arg\max$ of an expression affine in $x$, with the dropped $\lVert x\rVert^2$ term justified (constant across candidates).
- The **explicit weight matrix and bias vector** ($W$ as a function of $X_{\text{train}}$, $b$ as a function of the per-point norms), with correct shapes.
- A clear **readout rule** mapping the winning unit's index back to a class label (e.g. a fixed one-hot label matrix as a second layer), framing the whole thing as a feed-forward template-matching net whose weights *are* the memorized data.
- Acknowledgment that this layer is **untrained / hand-set**, plus the parity (or floating-point caveat) between this `argmax` formulation and the Part 1 `argmin`.
### What a Strong Answer Covers
These dimensions span both parts:
- **Equivalence and parity:** the candidate demonstrates that the Part 1 distance-`argmin` and the Part 2 logit-`argmax` produce the same predictions (and can articulate the small floating-point caveat — Part 2 skips the $\lVert x\rVert^2$ subtraction, so rounding differs slightly).
- **Edge-case handling that applies throughout:** shape/dimension mismatches, empty train/test inputs, duplicate/coincident points, and how NaNs would propagate.
- **Scaling and metric sensitivity:** feature-scale sensitivity of Euclidean 1NN, and how the matmul expansion specializes for cosine (L2-normalize, then a pure dot-product `argmax`).
### Follow-up Questions
- How would you generalize this to $k$NN with a majority vote? What changes in both the NumPy code and the neural-network view, and what new tie-break ambiguities appear?
- The neural-network readout uses a hard $\arg\max$. How would you make it differentiable, and what does the temperature parameter control as it goes to zero?
- The matmul-based distance expansion is numerically unstable in some cases. Where does the cancellation come from, and how would you detect or mitigate it?
- `n_train` is now in the tens of millions. Brute force no longer fits — what indexing or approximation would you reach for, and how does the right choice depend on `d`?
Quick Answer: This question evaluates implementing a 1-nearest-neighbor classifier with NumPy, testing skills in vectorized numerical computation, distance metrics, tie-breaking label handling, and translating algorithms into linear-algebra operations.