Implement K-Means and Explain Convergence
Company: LinkedIn
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Implement the K-means clustering algorithm for points in Euclidean space.
Your implementation should:
- take a dataset of points and a target number of clusters `k`,
- initialize `k` centroids,
- repeatedly assign each point to its nearest centroid,
- recompute each centroid as the mean of assigned points,
- stop using a reasonable convergence criterion.
Then explain stopping criteria, empty-cluster handling, and time complexity.
### Constraints & Assumptions
- Points are numeric vectors of equal dimension.
- `k` is positive and no larger than the number of points unless you explicitly handle that case.
- Euclidean distance is used.
- A deterministic initialization such as first `k` points is acceptable for a coding interview, but mention random or k-means++ initialization as alternatives.
- Return cluster assignments and centroids.
### Clarifying Questions to Ask
- Should initialization be deterministic, random, or k-means++?
- Should the implementation return labels, centroids, or both?
- What should happen if a cluster becomes empty?
- What tolerance and maximum iterations should be used?
### What a Strong Answer Covers
- Clear assignment and update steps.
- A convergence check such as unchanged assignments, centroid movement below tolerance, objective improvement below tolerance, or maximum iterations.
- Empty cluster handling.
- Objective function intuition.
- Time complexity per iteration in terms of number of points, clusters, and dimensions.
### Follow-up Questions
- Why does K-means converge?
- Does K-means always find the global optimum?
- How would k-means++ improve initialization?
- How would you scale K-means to very large datasets?
Quick Answer: Implement K-means clustering in Python and explain convergence. Covers assignment and update steps, stopping criteria, empty clusters, objective function, k-means++ discussion, and time complexity.