Implement K-means clustering from scratch
Company: Microsoft
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
## Task: Implement K-means clustering (from scratch)
Write a function to perform **K-means clustering** on a set of points.
### Input
- A dataset `X` with shape `(n_samples, n_features)` (e.g., a NumPy array or PyTorch tensor).
- An integer `k` = number of clusters.
- Optional parameters:
- `max_iters` (e.g., 100)
- `tol` for convergence (e.g., 1e-4)
- Initialization method (e.g., random points from `X`)
### Output
Return:
- `centroids`: shape `(k, n_features)`
- `labels`: length `n_samples`, where each label is in `[0, k-1]`
### Requirements / Discussion Points
- Describe and implement the two core steps iteratively:
1. **Assignment step**: assign each point to the nearest centroid (e.g., Euclidean distance).
2. **Update step**: recompute each centroid as the mean of assigned points.
- Define a **stopping condition** (e.g., centroid shift < `tol` or `max_iters` reached).
- Handle edge cases (at least verbally), e.g.:
- A cluster gets **no points** in an iteration (empty cluster).
- `k > n_samples`.
- You may use **Python with NumPy and/or PyTorch**, but do not call a library K-means implementation.
*(In the interview, correctness and reasoning matter more than making it fully runnable.)*
Quick Answer: This question evaluates competency in clustering algorithms, numerical computation, and practical algorithm implementation, focusing on core K-means concepts like distance-based grouping and centroid estimation. It is commonly asked because it exposes practical implementation ability, convergence reasoning, numerical edge-case handling (e.g.