Deep Learning Concepts and K-means Implementation (Onsite ML Interview)
This is a two-part onsite round for a Data Scientist role: a conceptual deep-learning comparison (Part A) followed by a from-scratch coding exercise (Part B). Part A probes whether you understand why different architectures fit different modalities — not just definitions — and asks you to be precise about LSTM math and tensor shapes. Part B asks you to implement and reason about K-means rigorously, including a short proof and complexity analysis.
Constraints & Assumptions
-
Part A, LSTM shapes:
assume a single-layer LSTM, input tensor with
batch=32
,
seq_len=100
,
feat=64
, and hidden size
H=128
. State the layout convention you use (batch-first vs time-first).
-
Part B:
input is a dense numeric matrix
X∈Rn×d
; you may use NumPy but
not
a clustering library (no
sklearn.cluster
). Distances are squared Euclidean.
-
You should be able to discuss, but are not required to fully implement,
k
-selection and the mini-batch variant in code.
Clarifying Questions to Ask
-
For Part A, should the comparison assume a fixed task per modality (e.g. image classification vs. text classification), or stay architecture-general?
-
Is a Transformer in scope as a third point of comparison for text, or should the focus stay on CNN vs RNN/LSTM?
-
For Part B, do you want a numerically stable distance computation, or is the naive broadcast acceptable for the expected data size?
-
Should K-means handle replicates (multiple random restarts) and return the best run, or is a single run sufficient?
-
Any constraints on memory (e.g. can the full
n×k
distance matrix be materialized)?
Part A: CNNs vs RNNs and LSTMs
Contrast CNNs and RNNs across two modalities — (i) 224×224 RGB images and (ii) variable-length text — and then dig into LSTM internals.
For the modality comparison, address: inductive biases (translation equivariance, spatial/temporal locality, temporal order), parameter sharing, receptive-field growth with depth, and ability to model long-range dependencies. Then state precisely when a 1D CNN can replace an RNN/Transformer for sequences, including assumptions and caveats.
For LSTMs specifically: (a) write the gate equations and define every symbol and its shape; (b) show mathematically how the cell state mitigates vanishing gradients; (c) compute the output shapes for the given input under both unidirectional and bidirectional single-layer settings.
What This Part Should Cover
-
Correct mapping of each modality's structure to the matching inductive bias (2D locality/translation equivariance for images; order + variable length for text), and
why
serializing an image for an RNN is a poor fit.
-
A precise, correct condition for "1D CNN can replace an RNN/Transformer" (bounded/coverable context via receptive field or dilations; parallelism + stable gradients as the upside; causal padding and unbounded/non-local dependencies as the caveats).
-
Fully correct LSTM gate equations with symbols, shapes, and the cell-state / "constant error carousel" argument for gradient flow.
-
Correct output and final-state shapes for both unidirectional and bidirectional cases, with the layout convention stated explicitly.
Part B: K-means From Scratch
Implement K-means in NumPy with: k-means++ initialization, vectorized assignment and update steps, and convergence based on the decrease of the objective J=∑i∥xi−μci∥2 (within-cluster sum of squared distances).
Then discuss: per-iteration time complexity O(n⋅k⋅d) and memory trade-offs; handling empty clusters; feature scaling and outliers; choosing k (silhouette, elbow, BIC); a proof that the objective is non-increasing per iteration; and a mini-batch variant with the conditions under which you would prefer it.
What This Part Should Cover
-
A correct, genuinely vectorized implementation: k-means++ init,
(n,k)
distance computation via the dot-product identity,
argmin
assignment, and a scatter-add /
bincount
mean update — no Python loop over the
n
points in the main steps.
-
Concrete handling of degenerate cases (empty clusters via reassign-to-farthest or split-largest; division-by-zero guard) and preprocessing (z-score scaling; outlier sensitivity of L2 and robust alternatives like k-medoids).
-
Correct complexity (
O(nkd)
per iteration for assignment,
O(nd)
for update) and memory reasoning (materializing vs. chunking the
(n,k)
matrix; float32).
-
A valid two-step monotonicity proof, a sound discussion of
k
-selection criteria (and their limitations), and a mini-batch variant with a clear "when to use it" (very large
n
/ streaming / limited compute) and its quality trade-off.
What a Strong Answer Covers
Across both parts, the interviewer is looking for an answer that connects mechanism to choice rather than reciting facts. Strong cross-cutting signals:
-
Reasoning from first principles
— every architecture or algorithmic choice is justified by the data structure or objective, not by popularity. The candidate names the inductive bias / failure mode driving each recommendation.
-
Mathematical precision
— LSTM equations, the
∂ct/∂ct−1
argument, tensor shapes, and the K-means monotonicity proof are all stated correctly and without hand-waving.
-
Engineering judgment
— numerical stability (distance identity, float32), memory/throughput trade-offs, and degenerate-case handling appear
unprompted
, signaling production experience.
-
Honesty about limits
— explicit caveats (CNN context limits vs. attention; LSTM still failing on very long horizons; K-means' local minima, sensitivity to scale/outliers, and the heuristic nature of
k
-selection).
Follow-up Questions
-
In Part A, how would a Transformer change your answer for the text modality, and what is its compute/memory cost relative to an LSTM and a dilated 1D CNN?
-
K-means assumes spherical, equal-variance clusters. How would you detect that this assumption is violated on a given dataset, and what would you switch to (e.g. GMM, DBSCAN, spectral clustering)?
-
Your monotonicity proof shows
J
is non-increasing, but K-means can still land in a poor local optimum. What does k-means++ buy you here, and can you state its approximation guarantee?
-
For the mini-batch variant, how does the count-based learning rate
η=1/countj
affect convergence, and what changes if the data distribution drifts over time (streaming)?