Implement KNN from Scratch
Company: J.P. Morgan
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Quick Answer: This question evaluates a candidate's understanding and practical implementation skills for instance-based supervised learning, specifically the k-nearest neighbors algorithm, covering distance metrics, neighbor selection, majority voting and tie-breaking, feature scaling effects, time complexity, and model selection considerations like k and weighting. It is commonly asked to assess both conceptual grasp of non-parametric classification and practical coding ability to implement and analyze algorithms, and it falls under coding & algorithms and machine learning domains with a mix of conceptual understanding and practical application.
Constraints
- 1 <= len(X_train) == len(y_train) <= 1000
- 0 <= len(X_test) <= 1000
- 1 <= len(X_train[0]) <= 50
- 1 <= k <= len(X_train)
- Feature values are integers or floats in the range [-10^4, 10^4]
- All class labels are integers
Examples
Input: ([[1, 1], [2, 2], [3, 3], [6, 6]], [0, 0, 1, 1], [[2.1, 2.1], [5, 5]], 3)
Expected Output: [0, 1]
Explanation: For [2.1, 2.1], the 3 nearest labels are [0, 1, 0], so the prediction is 0. For [5, 5], the 3 nearest labels are [1, 1, 0], so the prediction is 1.
Input: ([[0, 0], [0, 2], [2, 0], [2, 2]], [0, 1, 1, 0], [[1, 1]], 4)
Expected Output: [0]
Explanation: All 4 training points are equally distant from [1, 1]. The vote counts are 0 -> 2 and 1 -> 2, so the deterministic tie-break returns the smaller label, 0.
Input: ([[1, 2], [1, 3], [4, 5], [5, 5], [5, 6]], [0, 0, 1, 1, 2], [[5, 5.2], [1, 2.4]], 3)
Expected Output: [1, 0]
Explanation: For [5, 5.2], the nearest labels are [1, 2, 1], so 1 wins. For [1, 2.4], the nearest labels are [0, 0, 1], so 0 wins.
Input: ([[0, 0], [1, 1]], [0, 1], [], 1)
Expected Output: []
Explanation: There are no test samples, so the result is an empty list.
Input: ([[3, 4]], [7], [[0, 0], [10, 10]], 1)
Expected Output: [7, 7]
Explanation: With only one training sample and k = 1, every test sample is classified as label 7.
Input: ([[-2, -1], [-1, -3], [2, 2], [3, 3]], [0, 0, 1, 1], [[-1.5, -2], [2.5, 2.5]], 3)
Expected Output: [0, 1]
Explanation: This checks negative coordinates. The first test point is closest to two class-0 points, while the second is closest to two class-1 points.
Hints
- For each test point, build a list of (distance, label) pairs by comparing it to every training point.
- After sorting by distance, only the first k labels matter. Use a frequency map to count votes, then apply the tie rule consistently.