Interview prompt (ML project deep dive)
You are interviewing for a Data Scientist role. The interviewer asks you to pick one ML project you have personally built and explain it end-to-end, with emphasis on technical details and interpretability.
Answer the following:
-
Project walkthrough
-
What is the problem statement and business goal?
-
What is the dataset (size, schema, label definition, time range), and what are the major data quality issues?
-
What model(s) did you try and why?
-
Feature decisions
-
How did you decide which features to include/exclude?
-
How did you avoid leakage (especially time-based leakage)?
-
How did you validate that features are useful and stable over time?
-
Hyperparameter tuning
-
What hyperparameters mattered most for your chosen model?
-
What tuning strategy did you use (grid/random/Bayesian/Hyperband), what metric did you optimize, and how did you prevent overfitting to the validation set?
-
How did you structure cross-validation (especially for time series / grouped users)?
-
PCA (Principal Component Analysis)
-
State the objective of PCA and write the key optimization problem.
-
Explain how PCA relates to the covariance matrix / SVD.
-
When is PCA appropriate vs. harmful for a supervised ML task?
-
SHAP values
-
What are SHAP values conceptually? Provide the connection to
Shapley values
.
-
What properties make SHAP attractive (e.g., additivity/consistency)?
-
How do you interpret common SHAP plots (e.g., summary plot/beeswarm, dependence plot, force plot)?
-
List at least 3 pitfalls or failure modes when using SHAP.
Assume you must communicate both to (a) an ML-literate peer and (b) a non-technical stakeholder.