How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Technical Screen rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at OpenAI during technical interviews.

Implement 1NN with NumPy | OpenAI Interview Question

Implement 1NN with NumPy

Company: OpenAI

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

Implement a **1-nearest-neighbor (1NN) classifier from scratch using NumPy**, then show that the same decision can be expressed as a neural-network-style computation with fixed weights and biases. You are given: - `X_train`: a NumPy array of shape `(n_train, d)` containing training feature vectors. - `y_train`: a NumPy array of shape `(n_train,)` containing the corresponding labels. - `X_test`: a NumPy array of shape `(n_test, d)` containing query (test) feature vectors. The classifier predicts each test example's label as the label of its single closest training example under **squared Euclidean distance**. ### Constraints & Assumptions - Use **squared Euclidean distance** $\lVert x - t_i \rVert^2$ as the proximity metric (no need to take the square root). - The implementation must be **vectorized** — no Python-level loops over individual test or training examples. - **Tie-break rule:** if two or more training examples are equidistant nearest neighbors of a query, return the label of the **earliest** such training example (smallest index in `X_train`). - Assume `X_train` and `X_test` share the same feature dimension `d`, and `y_train` aligns with `X_train` row-for-row. ### Clarifying Questions to Ask - Can I assume the inputs are already NumPy arrays of consistent dtype, or should I coerce/validate shapes and dtypes inside the function? - How large can `n_train`, `n_test`, and `d` get? Does the full `(n_test, n_train)` distance matrix fit in memory, or should I plan to batch the test set? - Is plain (squared) Euclidean distance the intended metric, or do you also want me to discuss cosine / other metrics? - For Part 2, do you want literal Keras/PyTorch layers, or just the explicit weight matrix and bias vector plus the readout rule? - Should I handle degenerate inputs (empty train or test set, duplicate points, NaNs), or can I assume well-formed data? --- ### Part 1 — Vectorized NumPy implementation Write a vectorized function that takes `X_train`, `y_train`, `X_test` and returns a length-`n_test` array of predicted labels, where each prediction is the label of the nearest training example under squared Euclidean distance. Respect the earliest-index tie-break rule above, and use no Python-level loops over examples. ```hint Where to start You need the full $(n_{\text{test}}, n_{\text{train}})$ matrix of pairwise squared distances, then an `argmin` along the training axis. Think about how to get every pairwise distance with array operations instead of nested loops. ``` ```hint Distance trick Materializing the difference for every test-train pair forces a 3-D intermediate. Try expanding $\lVert x - t \rVert^2$ algebraically instead — which of the resulting terms is a genuine pairwise interaction, and which depend on only one side? Once you separate them, ask which can be precomputed per row or per column and reused via broadcasting. ``` ```hint Tie-break Look up exactly how `np.argmin` resolves the situation where several entries along the axis tie for the minimum. Then think about the order in which the training examples appear along that axis, and whether the rule you're asked for falls out of that behavior or needs handling on top. ``` #### What This Part Should Cover - A fully **vectorized** distance computation with **no per-example Python loops**, ideally via the expanded-norm matmul identity that maps onto BLAS. - An explicit argument for **why `argmin` along the training axis satisfies the earliest-index tie-break** (first-occurrence semantics + columns in training order). - A correct **time/memory complexity** characterization and the recognition that the full `(n_test, n_train)` score matrix is the memory bottleneck (and a batching strategy when it doesn't fit). - **Numerical-stability awareness**: the expanded form can cancel catastrophically (tiny negative "distances" for coincident points, flipped near-ties) versus the direct subtract-and-square form — and when each is appropriate. ### Part 2 — As a neural-network computation with fixed weights Show how the same nearest-neighbor decision can be represented as a neural-network-style computation with **fixed** weights and biases derived directly from `X_train` (no training). Derive the explicit weight matrix and bias vector, and state the readout rule that recovers the predicted label. ```hint Where to start A 1NN decision is an $\arg\min$ over training examples. Can you turn that into an $\arg\max$ of an expression that is *affine in the query* $x$? An affine map followed by a winner-take-all is exactly one dense layer. ``` ```hint Drop a constant Write out the squared-distance score for a single training point and ask which terms actually depend on *which* training point you're comparing against. Any term that is the same for every candidate can't break the ranking — so it can be discarded before you try to read off a linear form. What's left, and what sign convention turns the "nearest" comparison into a "largest score" one? ``` ```hint Read off W and b Once the score is purely linear in the query $x$ plus a per-training-point constant, line it up against the shape of a dense layer's output, $xW^\top + b$. The query-dependent part tells you what each row of $W$ must be (in terms of a training vector); the constant part tells you the per-unit bias. The predicted label then comes from a fixed lookup keyed on the winning unit's index. ``` #### What This Part Should Cover - A correct **algebraic reduction** from $\arg\min$ of $\lVert x - t_i\rVert^2$ to $\arg\max$ of an expression affine in $x$, with the dropped $\lVert x\rVert^2$ term justified (constant across candidates). - The **explicit weight matrix and bias vector** ($W$ as a function of $X_{\text{train}}$, $b$ as a function of the per-point norms), with correct shapes. - A clear **readout rule** mapping the winning unit's index back to a class label (e.g. a fixed one-hot label matrix as a second layer), framing the whole thing as a feed-forward template-matching net whose weights *are* the memorized data. - Acknowledgment that this layer is **untrained / hand-set**, plus the parity (or floating-point caveat) between this `argmax` formulation and the Part 1 `argmin`. ### What a Strong Answer Covers These dimensions span both parts: - **Equivalence and parity:** the candidate demonstrates that the Part 1 distance-`argmin` and the Part 2 logit-`argmax` produce the same predictions (and can articulate the small floating-point caveat — Part 2 skips the $\lVert x\rVert^2$ subtraction, so rounding differs slightly). - **Edge-case handling that applies throughout:** shape/dimension mismatches, empty train/test inputs, duplicate/coincident points, and how NaNs would propagate. - **Scaling and metric sensitivity:** feature-scale sensitivity of Euclidean 1NN, and how the matmul expansion specializes for cosine (L2-normalize, then a pure dot-product `argmax`). ### Follow-up Questions - How would you generalize this to $k$NN with a majority vote? What changes in both the NumPy code and the neural-network view, and what new tie-break ambiguities appear? - The neural-network readout uses a hard $\arg\max$. How would you make it differentiable, and what does the temperature parameter control as it goes to zero? - The matmul-based distance expansion is numerically unstable in some cases. Where does the cancellation come from, and how would you detect or mitigate it? - `n_train` is now in the tens of millions. Brute force no longer fits — what indexing or approximation would you reach for, and how does the right choice depend on `d`?

Quick Answer: This question evaluates implementing a 1-nearest-neighbor classifier with NumPy, testing skills in vectorized numerical computation, distance metrics, tie-breaking label handling, and translating algorithms into linear-algebra operations.

Implement a 1-nearest-neighbor (1NN) classifier from scratch using NumPy, then show that the same decision can be expressed as a neural-network-style computation with fixed weights and biases.

You are given:

X_train : a NumPy array of shape (n_train, d) containing training feature vectors.
y_train : a NumPy array of shape (n_train,) containing the corresponding labels.
X_test : a NumPy array of shape (n_test, d) containing query (test) feature vectors.

The classifier predicts each test example's label as the label of its single closest training example under squared Euclidean distance.

Constraints & Assumptions

Use squared Euclidean distance $\lVert x - t_i \rVert^2$ as the proximity metric (no need to take the square root).
The implementation must be vectorized — no Python-level loops over individual test or training examples.
Tie-break rule: if two or more training examples are equidistant nearest neighbors of a query, return the label of the earliest such training example (smallest index in X_train ).
Assume X_train and X_test share the same feature dimension d , and y_train aligns with X_train row-for-row.

Clarifying Questions to Ask

Can I assume the inputs are already NumPy arrays of consistent dtype, or should I coerce/validate shapes and dtypes inside the function?
How large can n_train , n_test , and d get? Does the full (n_test, n_train) distance matrix fit in memory, or should I plan to batch the test set?
Is plain (squared) Euclidean distance the intended metric, or do you also want me to discuss cosine / other metrics?
For Part 2, do you want literal Keras/PyTorch layers, or just the explicit weight matrix and bias vector plus the readout rule?
Should I handle degenerate inputs (empty train or test set, duplicate points, NaNs), or can I assume well-formed data?

Part 1 — Vectorized NumPy implementation

Write a vectorized function that takes X_train, y_train, X_test and returns a length-n_test array of predicted labels, where each prediction is the label of the nearest training example under squared Euclidean distance. Respect the earliest-index tie-break rule above, and use no Python-level loops over examples.

What This Part Should Cover

A fully vectorized distance computation with no per-example Python loops , ideally via the expanded-norm matmul identity that maps onto BLAS.
An explicit argument for why argmin along the training axis satisfies the earliest-index tie-break (first-occurrence semantics + columns in training order).
A correct time/memory complexity characterization and the recognition that the full (n_test, n_train) score matrix is the memory bottleneck (and a batching strategy when it doesn't fit).
Numerical-stability awareness : the expanded form can cancel catastrophically (tiny negative "distances" for coincident points, flipped near-ties) versus the direct subtract-and-square form — and when each is appropriate.

Part 2 — As a neural-network computation with fixed weights

Show how the same nearest-neighbor decision can be represented as a neural-network-style computation with fixed weights and biases derived directly from X_train (no training). Derive the explicit weight matrix and bias vector, and state the readout rule that recovers the predicted label.

What This Part Should Cover

A correct algebraic reduction from $\arg\min$ of $\lVert x - t_i\rVert^2$ to $\arg\max$ of an expression affine in $x$ , with the dropped $\lVert x\rVert^2$ term justified (constant across candidates).
The explicit weight matrix and bias vector ( $W$ as a function of $X_{\text{train}}$ , $b$ as a function of the per-point norms), with correct shapes.
A clear readout rule mapping the winning unit's index back to a class label (e.g. a fixed one-hot label matrix as a second layer), framing the whole thing as a feed-forward template-matching net whose weights are the memorized data.
Acknowledgment that this layer is untrained / hand-set , plus the parity (or floating-point caveat) between this argmax formulation and the Part 1 argmin .

What a Strong Answer Covers

These dimensions span both parts:

Equivalence and parity: the candidate demonstrates that the Part 1 distance- argmin and the Part 2 logit- argmax produce the same predictions (and can articulate the small floating-point caveat — Part 2 skips the $\lVert x\rVert^2$ subtraction, so rounding differs slightly).
Edge-case handling that applies throughout: shape/dimension mismatches, empty train/test inputs, duplicate/coincident points, and how NaNs would propagate.
Scaling and metric sensitivity: feature-scale sensitivity of Euclidean 1NN, and how the matmul expansion specializes for cosine (L2-normalize, then a pure dot-product argmax ).

Follow-up Questions

How would you generalize this to $k$ NN with a majority vote? What changes in both the NumPy code and the neural-network view, and what new tie-break ambiguities appear?
The neural-network readout uses a hard $\arg\max$ . How would you make it differentiable, and what does the temperature parameter control as it goes to zero?
The matmul-based distance expansion is numerically unstable in some cases. Where does the cancellation come from, and how would you detect or mitigate it?
n_train is now in the tens of millions. Brute force no longer fits — what indexing or approximation would you reach for, and how does the right choice depend on d ?

Implement 1NN with NumPy

Company: OpenAI

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

Implement 1NN with NumPy

Quick Overview

Implement 1NN with NumPy

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Vectorized NumPy implementation

What This Part Should Cover

Part 2 — As a neural-network computation with fixed weights

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Write your answer

Implement 1NN with NumPy

Quick Overview

Implement 1NN with NumPy

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Vectorized NumPy implementation

What This Part Should Cover

Part 2 — As a neural-network computation with fixed weights

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Write your answer