Scenario
An interviewer is deep-diving into an ML project you built (you can assume it is a supervised model unless specified otherwise). They want you to justify model choices, evaluation, and training decisions.
Part A — Evaluation design and metrics
-
Describe
how you evaluate
your model end-to-end (data split strategy, validation protocol, test usage).
-
Which
metrics
do you use and
why
(business/ML tradeoffs)?
-
Provide the
mathematical definitions
for the metrics you mention (e.g., accuracy, precision/recall, F1, ROC-AUC, PR-AUC, log loss, MSE/MAE, calibration metrics).
-
Propose a
better evaluation workflow
than a single holdout set (e.g., cross-validation, time-based split, stratification, repeated runs, confidence intervals).
-
If you can add
human labels
(or human evaluation), explain:
-
what you would label,
-
how you would ensure quality (guidelines, inter-annotator agreement),
-
how it improves the evaluation signal.
-
If you have
no labels
, what is the
simplest
way to estimate whether two model outputs/answers are
similar
?
Part B — Transformers vs RNNs on long inputs
-
Compare
Transformer
and
RNN/LSTM/GRU
architectures.
-
For
very long sequences
, discuss the pros/cons of each (training stability, ability to capture long-range dependencies, compute/memory).
-
Explain why
attention
can capture long-range dependencies, and why vanilla RNNs often struggle.
Part C — Detecting distribution mismatch in images
You have two sets of images (Set A and Set B). How would you test whether they appear to come from the same underlying distribution?
Part D — Optimizers
Compare the practical differences and tradeoffs among SGD (with/without momentum), RMSProp, Adam, and AdamW. When would AdamW be preferable?