Implement K-means clustering from scratch
Company: Microsoft
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Quick Answer: This question evaluates competency in clustering algorithms, numerical computation, and practical algorithm implementation, focusing on core K-means concepts like distance-based grouping and centroid estimation. It is commonly asked because it exposes practical implementation ability, convergence reasoning, numerical edge-case handling (e.g.
Constraints
- For valid clustering input, 1 <= k <= n_samples
- 1 <= n_features <= 20
- 0 <= max_iters <= 1000
- Points contain integers or floats
- If X is empty, k <= 0, or k > n_samples, return ([], [])
Examples
Input: ([[1, 1], [1, 2], [9, 9], [8, 8]], 2)
Expected Output: ([[1.0, 1.5], [8.5, 8.5]], [0, 0, 1, 1])
Explanation: The points split into two natural groups. After convergence, their means are [1.0, 1.5] and [8.5, 8.5].
Input: ([[2, 2], [4, 4], [6, 6]], 1)
Expected Output: ([[4.0, 4.0]], [0, 0, 0])
Explanation: With only one cluster, every point belongs to it and the centroid is the overall mean.
Input: ([[0, 0], [1, 1]], 3)
Expected Output: ([], [])
Explanation: The input is invalid because k is greater than the number of samples.
Input: ([[0, 0], [0, 0], [10, 10]], 2)
Expected Output: ([[0.0, 0.0], [10.0, 10.0]], [0, 0, 1])
Explanation: The initial centroids are identical, so one cluster is empty in the first update. Keeping the empty centroid unchanged still allows the algorithm to recover.
Input: ([[-2, -2], [-1, -1], [4, 4], [5, 5]], 2, 50, 1e-6)
Expected Output: ([[-1.5, -1.5], [4.5, 4.5]], [0, 0, 1, 1])
Explanation: The negative points form one cluster and the positive points form the other, with centroids at their respective means.
Hints
- Keep the algorithm in a loop with two phases: assign each point to its closest centroid, then recompute each centroid as the mean of its assigned points.
- To make the answer deterministic, think carefully about tie-breaking, empty clusters, and the final ordering of centroids.