Implement robust k-means from scratch

Q: Implement robust k-means from scratch

This question evaluates implementation and engineering competency in K-Means clustering, covering numerical stability, edge-case handling (empty clusters, NaNs/Infs), initialization strategies, convergence criteria, complexity analysis, evaluation methods for selecting k, and unit testing; it targets the Machine Learning domain, specifically unsupervised clustering and algorithmic implementation. It is commonly asked to assess both practical application (writing production-ready, vectorized code and testable algorithms) and conceptual understanding (trade-offs in initialization, convergence, computational complexity, and evaluation metrics), demonstrating the ability to deliver robust, scalable clustering solutions.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Implement K-Means Clustering From Scratch (Production-Ready)

Context

You are asked to implement K-Means clustering from scratch for a machine learning interview. Write code that is robust enough for production use (numerical stability, edge-case handling) and provide analysis, evaluation strategy, and tests.

Requirements

Function signature and return:
- kmeans(X, k, init="k-means++", max_iter=300, tol=1e-4, random_state=0, n_init=10) → (centers, labels, inertia)
Inputs:
- X: float matrix of shape (n × d). Use no ML libraries; vectorize where possible with NumPy.
Initialization:
- Implement k-means++ seeding.
- Run n_init independent restarts; return the run with minimal inertia.
Convergence:
- Stop when the relative change in inertia is < tol for two consecutive iterations.
Empty clusters:
- Must handle by splitting the cluster with the highest within-cluster variance at the furthest point from its centroid. Describe and implement this policy.
Numerical stability:
- Discuss and implement safe handling for NaNs/Infs and feature scaling. Support optional standardization.
Complexity:
- Analyze time and space complexity in terms of n, d, k. Propose a mini-batch K-Means variant and explain when it helps.
Evaluation:
- Describe how to pick k (e.g., silhouette score, gap statistic) and how to test stability across random seeds.
Tests:
- Provide unit tests covering: tiny degenerate cases (n<k, duplicate points), high-dimensional sparse-like data, and convergence behavior.

Implement robust k-means from scratch

Implement K-Means Clustering From Scratch (Production-Ready)

Context

Requirements

Solution

Comments (0)

Implement robust k-means from scratch

Overview

Implement K-Means Clustering From Scratch (Production-Ready)

Context

Requirements

Solution

Comments (0)