PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates competency in clustering algorithms, numerical computation, and practical algorithm implementation, focusing on core K-means concepts like distance-based grouping and centroid estimation. It is commonly asked because it exposes practical implementation ability, convergence reasoning, numerical edge-case handling (e.g.

  • medium
  • Microsoft
  • Coding & Algorithms
  • Machine Learning Engineer

Implement K-means clustering from scratch

Company: Microsoft

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

## Task: Implement K-means clustering (from scratch) Write a function to perform **K-means clustering** on a set of points. ### Input - A dataset `X` with shape `(n_samples, n_features)` (e.g., a NumPy array or PyTorch tensor). - An integer `k` = number of clusters. - Optional parameters: - `max_iters` (e.g., 100) - `tol` for convergence (e.g., 1e-4) - Initialization method (e.g., random points from `X`) ### Output Return: - `centroids`: shape `(k, n_features)` - `labels`: length `n_samples`, where each label is in `[0, k-1]` ### Requirements / Discussion Points - Describe and implement the two core steps iteratively: 1. **Assignment step**: assign each point to the nearest centroid (e.g., Euclidean distance). 2. **Update step**: recompute each centroid as the mean of assigned points. - Define a **stopping condition** (e.g., centroid shift < `tol` or `max_iters` reached). - Handle edge cases (at least verbally), e.g.: - A cluster gets **no points** in an iteration (empty cluster). - `k > n_samples`. - You may use **Python with NumPy and/or PyTorch**, but do not call a library K-means implementation. *(In the interview, correctness and reasoning matter more than making it fully runnable.)*

Quick Answer: This question evaluates competency in clustering algorithms, numerical computation, and practical algorithm implementation, focusing on core K-means concepts like distance-based grouping and centroid estimation. It is commonly asked because it exposes practical implementation ability, convergence reasoning, numerical edge-case handling (e.g.

Write a function that performs K-means clustering on a dataset of points without using any library K-means implementation. For deterministic grading, use these rules: 1. Initialize the centroids as the first k points in X. 2. In the assignment step, assign each point to the nearest centroid using Euclidean distance. If two centroids are tied, choose the smaller centroid index. 3. In the update step, recompute each centroid as the mean of all points assigned to it. 4. If a cluster gets no points in an iteration, keep its centroid unchanged. 5. Stop when the maximum centroid shift is less than or equal to tol, or when max_iters iterations have been completed. 6. Before returning, sort the final centroids lexicographically and remap labels to match that sorted order. If the input is invalid (empty dataset, k <= 0, or k > number of samples), return ([], []).

Constraints

  • For valid clustering input, 1 <= k <= n_samples
  • 1 <= n_features <= 20
  • 0 <= max_iters <= 1000
  • Points contain integers or floats
  • If X is empty, k <= 0, or k > n_samples, return ([], [])

Examples

Input: ([[1, 1], [1, 2], [9, 9], [8, 8]], 2)

Expected Output: ([[1.0, 1.5], [8.5, 8.5]], [0, 0, 1, 1])

Explanation: The points split into two natural groups. After convergence, their means are [1.0, 1.5] and [8.5, 8.5].

Input: ([[2, 2], [4, 4], [6, 6]], 1)

Expected Output: ([[4.0, 4.0]], [0, 0, 0])

Explanation: With only one cluster, every point belongs to it and the centroid is the overall mean.

Input: ([[0, 0], [1, 1]], 3)

Expected Output: ([], [])

Explanation: The input is invalid because k is greater than the number of samples.

Input: ([[0, 0], [0, 0], [10, 10]], 2)

Expected Output: ([[0.0, 0.0], [10.0, 10.0]], [0, 0, 1])

Explanation: The initial centroids are identical, so one cluster is empty in the first update. Keeping the empty centroid unchanged still allows the algorithm to recover.

Input: ([[-2, -2], [-1, -1], [4, 4], [5, 5]], 2, 50, 1e-6)

Expected Output: ([[-1.5, -1.5], [4.5, 4.5]], [0, 0, 1, 1])

Explanation: The negative points form one cluster and the positive points form the other, with centroids at their respective means.

Hints

  1. Keep the algorithm in a loop with two phases: assign each point to its closest centroid, then recompute each centroid as the mean of its assigned points.
  2. To make the answer deterministic, think carefully about tie-breaking, empty clusters, and the final ordering of centroids.
Last updated: Jun 1, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Return Top K Open Businesses - Microsoft (hard)
  • Implement Memory Allocation and In-Memory Records - Microsoft (medium)
  • Implement K-Means and Detect Divisible Subarrays - Microsoft (medium)
  • Sort Three Categories In Place - Microsoft (medium)
  • Retain Top K Elements - Microsoft (medium)