How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Microsoft.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Microsoft during technical interviews.

Implement K-means clustering from scratch | Microsoft Coding Question

Quick Overview

This question evaluates competency in clustering algorithms, numerical computation, and practical algorithm implementation, focusing on core K-means concepts like distance-based grouping and centroid estimation. It is commonly asked because it exposes practical implementation ability, convergence reasoning, numerical edge-case handling (e.g.

Implement K-means clustering from scratch

Company: Microsoft

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

## Task: Implement K-means clustering (from scratch) Write a function to perform **K-means clustering** on a set of points. ### Input - A dataset `X` with shape `(n_samples, n_features)` (e.g., a NumPy array or PyTorch tensor). - An integer `k` = number of clusters. - Optional parameters: - `max_iters` (e.g., 100) - `tol` for convergence (e.g., 1e-4) - Initialization method (e.g., random points from `X`) ### Output Return: - `centroids`: shape `(k, n_features)` - `labels`: length `n_samples`, where each label is in `[0, k-1]` ### Requirements / Discussion Points - Describe and implement the two core steps iteratively: 1. **Assignment step**: assign each point to the nearest centroid (e.g., Euclidean distance). 2. **Update step**: recompute each centroid as the mean of assigned points. - Define a **stopping condition** (e.g., centroid shift < `tol` or `max_iters` reached). - Handle edge cases (at least verbally), e.g.: - A cluster gets **no points** in an iteration (empty cluster). - `k > n_samples`. - You may use **Python with NumPy and/or PyTorch**, but do not call a library K-means implementation. *(In the interview, correctness and reasoning matter more than making it fully runnable.)*

Quick Answer: This question evaluates competency in clustering algorithms, numerical computation, and practical algorithm implementation, focusing on core K-means concepts like distance-based grouping and centroid estimation. It is commonly asked because it exposes practical implementation ability, convergence reasoning, numerical edge-case handling (e.g.

Write a function that performs K-means clustering on a dataset of points without using any library K-means implementation. For deterministic grading, use these rules: 1. Initialize the centroids as the first k points in X. 2. In the assignment step, assign each point to the nearest centroid using Euclidean distance. If two centroids are tied, choose the smaller centroid index. 3. In the update step, recompute each centroid as the mean of all points assigned to it. 4. If a cluster gets no points in an iteration, keep its centroid unchanged. 5. Stop when the maximum centroid shift is less than or equal to tol, or when max_iters iterations have been completed. 6. Before returning, sort the final centroids lexicographically and remap labels to match that sorted order. If the input is invalid (empty dataset, k <= 0, or k > number of samples), return ([], []).

Constraints

For valid clustering input, 1 <= k <= n_samples
1 <= n_features <= 20
0 <= max_iters <= 1000
Points contain integers or floats
If X is empty, k <= 0, or k > n_samples, return ([], [])

Examples

Input: ([[1, 1], [1, 2], [9, 9], [8, 8]], 2)

Expected Output: ([[1.0, 1.5], [8.5, 8.5]], [0, 0, 1, 1])

Explanation: The points split into two natural groups. After convergence, their means are [1.0, 1.5] and [8.5, 8.5].

Input: ([[2, 2], [4, 4], [6, 6]], 1)

Expected Output: ([[4.0, 4.0]], [0, 0, 0])

Explanation: With only one cluster, every point belongs to it and the centroid is the overall mean.

Input: ([[0, 0], [1, 1]], 3)

Expected Output: ([], [])

Explanation: The input is invalid because k is greater than the number of samples.

Input: ([[0, 0], [0, 0], [10, 10]], 2)

Expected Output: ([[0.0, 0.0], [10.0, 10.0]], [0, 0, 1])

Explanation: The initial centroids are identical, so one cluster is empty in the first update. Keeping the empty centroid unchanged still allows the algorithm to recover.

Input: ([[-2, -2], [-1, -1], [4, 4], [5, 5]], 2, 50, 1e-6)

Expected Output: ([[-1.5, -1.5], [4.5, 4.5]], [0, 0, 1, 1])

Explanation: The negative points form one cluster and the positive points form the other, with centroids at their respective means.

Hints

Keep the algorithm in a loop with two phases: assign each point to its closest centroid, then recompute each centroid as the mean of its assigned points.
To make the answer deterministic, think carefully about tie-breaking, empty clusters, and the final ordering of centroids.

Quick Overview