Explain project details, PCA, and SHAP
Company: Point72
Role: Data Scientist
Category: Machine Learning
Difficulty: easy
Interview Round: Technical Screen
### Interview prompt (ML project deep dive)
You are interviewing for a **Data Scientist** role. The interviewer asks you to pick **one ML project you have personally built** and explain it end-to-end, with emphasis on technical details and interpretability.
Answer the following:
1. **Project walkthrough**
- What is the problem statement and business goal?
- What is the dataset (size, schema, label definition, time range), and what are the major data quality issues?
- What model(s) did you try and why?
2. **Feature decisions**
- How did you decide which features to include/exclude?
- How did you avoid leakage (especially time-based leakage)?
- How did you validate that features are useful and stable over time?
3. **Hyperparameter tuning**
- What hyperparameters mattered most for your chosen model?
- What tuning strategy did you use (grid/random/Bayesian/Hyperband), what metric did you optimize, and how did you prevent overfitting to the validation set?
- How did you structure cross-validation (especially for time series / grouped users)?
4. **PCA (Principal Component Analysis)**
- State the objective of PCA and write the key optimization problem.
- Explain how PCA relates to the covariance matrix / SVD.
- When is PCA appropriate vs. harmful for a supervised ML task?
5. **SHAP values**
- What are SHAP values conceptually? Provide the connection to **Shapley values**.
- What properties make SHAP attractive (e.g., additivity/consistency)?
- How do you interpret common SHAP plots (e.g., summary plot/beeswarm, dependence plot, force plot)?
- List at least 3 pitfalls or failure modes when using SHAP.
Assume you must communicate both to (a) an ML-literate peer and (b) a non-technical stakeholder.
Quick Answer: This question evaluates a data scientist's competencies in end-to-end machine learning project development, including dataset characterization, feature engineering and leakage control, model selection and hyperparameter tuning, dimensionality reduction via PCA, and interpretability using SHAP.