Implement K-means and solve interval/frequency tasks
Company: Amazon
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Onsite
## Task 1 — Describe/implement K-means clustering
Given:
- A data matrix `X` with shape `(n_samples, d)`.
- An integer `k` (number of clusters).
Explain (or write high-level pseudocode for) the K-means algorithm such that:
- The steps are unambiguous.
- Matrix/vector dimensions are consistent.
- You cover initialization, assignment, centroid update, and stopping criteria.
## Task 2 — Merge overlapping intervals
Given a list of closed intervals `intervals`, where each interval is `[start, end]` and `start <= end`, merge all overlapping intervals and return the merged list.
**Input:** `intervals: List[List[int]]`
**Output:** merged intervals in any order (or sorted by start), with no overlaps.
## Task 3 — Return the top-k most frequent elements
Given an integer array `nums` and an integer `k`, return the `k` values that occur most frequently.
**Input:** `nums: List[int]`, `k: int`
**Output:** `List[int]` containing `k` elements.
### Constraints (assume typical interview constraints)
- `1 <= n <= 2e5`
- Values fit in 32-bit signed integers
- `1 <= k <= number of distinct values in nums`
Quick Answer: This multi-part question evaluates understanding of K-means clustering, interval-merging algorithms, and frequency-based top-k retrieval, assessing competencies in unsupervised learning concepts, algorithmic problem solving, data structures, and complexity analysis.
Part 1: Deterministic K-means Clustering
Given a data matrix X with n points and d features per point, implement a deterministic version of K-means clustering. Use the first k points of X as the initial centroids. In each iteration, assign every point to the nearest centroid using squared Euclidean distance; if two centroids are tied, choose the smaller centroid index. Then update each centroid to the coordinate-wise mean of the points assigned to it. If a cluster receives no points, keep its centroid unchanged. Stop when the labels no longer change or when max_iters iterations have been executed. Return both the final centroids and the label for each point. Do not round the centroid values.
Constraints
- 1 <= n <= 10^3
- 1 <= d <= 20
- 1 <= k <= n
- 1 <= max_iters <= 100
- All rows of X have the same length d
- Each coordinate fits in standard Python int/float ranges
Examples
Input: ([[1, 1], [1, 2], [4, 4], [4, 5]], 2)
Expected Output: {'centroids': [[1.0, 1.5], [4.0, 4.5]], 'labels': [0, 0, 1, 1]}