PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Microsoft

Compare CNN/RNN/LSTM and implement K-means

Last updated: Jun 24, 2026

Quick Overview

This question evaluates understanding of deep learning architectures (CNN, RNN, LSTM) and unsupervised clustering implementation (K-means), covering inductive biases, parameter sharing and receptive fields, LSTM gating and vanishing-gradient behavior, tensor shape reasoning, algorithmic complexity, initialization, and cluster handling.

  • hard
  • Microsoft
  • Machine Learning
  • Data Scientist

Compare CNN/RNN/LSTM and implement K-means

Company: Microsoft

Role: Data Scientist

Category: Machine Learning

Difficulty: hard

Interview Round: Onsite

Part A (concepts): Contrast CNNs and RNNs for (i) 224×224 RGB images and (ii) variable-length text. Explain inductive biases (translation equivariance, locality, temporal order), parameter sharing, receptive-field growth, and ability to model long-range dependencies; when can a 1D CNN replace an RNN/Transformer? For LSTMs, write the gate equations, show how the cell state mitigates vanishing gradients, and compute output shapes for input (batch=32, seq_len=100, feat=64) with hidden size 128 for unidirectional vs bidirectional cases. Part B (coding): Implement K-means from scratch with k-means++ initialization, vectorized assignment/update steps, and convergence on objective decrease. Discuss O(n·k·d) time and memory trade-offs, handling empty clusters, feature scaling and outliers, choosing k (silhouette/elbow/BIC), and prove the objective is non-increasing per iteration. Propose a mini-batch variant and when you’d use it.

Quick Answer: This question evaluates understanding of deep learning architectures (CNN, RNN, LSTM) and unsupervised clustering implementation (K-means), covering inductive biases, parameter sharing and receptive fields, LSTM gating and vanishing-gradient behavior, tensor shape reasoning, algorithmic complexity, initialization, and cluster handling.

Related Interview Questions

  • How do you choose a model? - Microsoft (medium)
  • Explain SHAP in an ML System - Microsoft (medium)
  • Explain normalization, regularization, CTR, imbalance handling - Microsoft (medium)
  • Clean OCR data and build an LLM dataset - Microsoft (medium)
  • Explain SHAP and build an ML project - Microsoft (easy)
|Home/Machine Learning/Microsoft

Compare CNN/RNN/LSTM and implement K-means

Microsoft logo
Microsoft
Oct 13, 2025, 9:49 PM
hardData ScientistOnsiteMachine Learning
9
0

Deep Learning Concepts and K-means Implementation (Onsite ML Interview)

This is a two-part onsite round for a Data Scientist role: a conceptual deep-learning comparison (Part A) followed by a from-scratch coding exercise (Part B). Part A probes whether you understand why different architectures fit different modalities — not just definitions — and asks you to be precise about LSTM math and tensor shapes. Part B asks you to implement and reason about K-means rigorously, including a short proof and complexity analysis.

Constraints & Assumptions

  • Part A, LSTM shapes: assume a single-layer LSTM, input tensor with batch=32 , seq_len=100 , feat=64 , and hidden size H=128 . State the layout convention you use (batch-first vs time-first).
  • Part B: input is a dense numeric matrix X∈Rn×dX \in \mathbb{R}^{n \times d}X∈Rn×d ; you may use NumPy but not a clustering library (no sklearn.cluster ). Distances are squared Euclidean.
  • You should be able to discuss, but are not required to fully implement, kkk -selection and the mini-batch variant in code.

Clarifying Questions to Ask

  • For Part A, should the comparison assume a fixed task per modality (e.g. image classification vs. text classification), or stay architecture-general?
  • Is a Transformer in scope as a third point of comparison for text, or should the focus stay on CNN vs RNN/LSTM?
  • For Part B, do you want a numerically stable distance computation, or is the naive broadcast acceptable for the expected data size?
  • Should K-means handle replicates (multiple random restarts) and return the best run, or is a single run sufficient?
  • Any constraints on memory (e.g. can the full n×kn \times kn×k distance matrix be materialized)?

Part A: CNNs vs RNNs and LSTMs

Contrast CNNs and RNNs across two modalities — (i) 224×224224\times224224×224 RGB images and (ii) variable-length text — and then dig into LSTM internals.

For the modality comparison, address: inductive biases (translation equivariance, spatial/temporal locality, temporal order), parameter sharing, receptive-field growth with depth, and ability to model long-range dependencies. Then state precisely when a 1D CNN can replace an RNN/Transformer for sequences, including assumptions and caveats.

For LSTMs specifically: (a) write the gate equations and define every symbol and its shape; (b) show mathematically how the cell state mitigates vanishing gradients; (c) compute the output shapes for the given input under both unidirectional and bidirectional single-layer settings.

What This Part Should Cover

  • Correct mapping of each modality's structure to the matching inductive bias (2D locality/translation equivariance for images; order + variable length for text), and why serializing an image for an RNN is a poor fit.
  • A precise, correct condition for "1D CNN can replace an RNN/Transformer" (bounded/coverable context via receptive field or dilations; parallelism + stable gradients as the upside; causal padding and unbounded/non-local dependencies as the caveats).
  • Fully correct LSTM gate equations with symbols, shapes, and the cell-state / "constant error carousel" argument for gradient flow.
  • Correct output and final-state shapes for both unidirectional and bidirectional cases, with the layout convention stated explicitly.

Part B: K-means From Scratch

Implement K-means in NumPy with: k-means++ initialization, vectorized assignment and update steps, and convergence based on the decrease of the objective J=∑i∥xi−μci∥2J = \sum_i \lVert x_i - \mu_{c_i}\rVert^2J=∑i​∥xi​−μci​​∥2 (within-cluster sum of squared distances).

Then discuss: per-iteration time complexity O(n⋅k⋅d)O(n\cdot k\cdot d)O(n⋅k⋅d) and memory trade-offs; handling empty clusters; feature scaling and outliers; choosing kkk (silhouette, elbow, BIC); a proof that the objective is non-increasing per iteration; and a mini-batch variant with the conditions under which you would prefer it.

What This Part Should Cover

  • A correct, genuinely vectorized implementation: k-means++ init, (n,k)(n,k)(n,k) distance computation via the dot-product identity, argmin assignment, and a scatter-add / bincount mean update — no Python loop over the nnn points in the main steps.
  • Concrete handling of degenerate cases (empty clusters via reassign-to-farthest or split-largest; division-by-zero guard) and preprocessing (z-score scaling; outlier sensitivity of L2 and robust alternatives like k-medoids).
  • Correct complexity ( O(nkd)O(n k d)O(nkd) per iteration for assignment, O(nd)O(nd)O(nd) for update) and memory reasoning (materializing vs. chunking the (n,k)(n,k)(n,k) matrix; float32).
  • A valid two-step monotonicity proof, a sound discussion of kkk -selection criteria (and their limitations), and a mini-batch variant with a clear "when to use it" (very large nnn / streaming / limited compute) and its quality trade-off.

What a Strong Answer Covers

Across both parts, the interviewer is looking for an answer that connects mechanism to choice rather than reciting facts. Strong cross-cutting signals:

  • Reasoning from first principles — every architecture or algorithmic choice is justified by the data structure or objective, not by popularity. The candidate names the inductive bias / failure mode driving each recommendation.
  • Mathematical precision — LSTM equations, the ∂ct/∂ct−1\partial c_t/\partial c_{t-1}∂ct​/∂ct−1​ argument, tensor shapes, and the K-means monotonicity proof are all stated correctly and without hand-waving.
  • Engineering judgment — numerical stability (distance identity, float32), memory/throughput trade-offs, and degenerate-case handling appear unprompted , signaling production experience.
  • Honesty about limits — explicit caveats (CNN context limits vs. attention; LSTM still failing on very long horizons; K-means' local minima, sensitivity to scale/outliers, and the heuristic nature of kkk -selection).

Follow-up Questions

  • In Part A, how would a Transformer change your answer for the text modality, and what is its compute/memory cost relative to an LSTM and a dilated 1D CNN?
  • K-means assumes spherical, equal-variance clusters. How would you detect that this assumption is violated on a given dataset, and what would you switch to (e.g. GMM, DBSCAN, spectral clustering)?
  • Your monotonicity proof shows JJJ is non-increasing, but K-means can still land in a poor local optimum. What does k-means++ buy you here, and can you state its approximation guarantee?
  • For the mini-batch variant, how does the count-based learning rate η=1/countj\eta = 1/\text{count}_jη=1/countj​ affect convergence, and what changes if the data distribution drifts over time (streaming)?
Loading comments...

Browse More Questions

More Machine Learning•More Microsoft•More Data Scientist•Microsoft Data Scientist•Microsoft Machine Learning•Data Scientist Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.