Explain Feature, Model, and Validation Choices
Company: Transunion
Role: Data Scientist
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
You are interviewing for a Data Scientist role. Describe how you would approach an end-to-end machine learning project on large-scale data.
In your answer, cover all of the following:
- the standard data processing workflow from raw data extraction to modeling and deployment-ready outputs;
- how you create and select features, including how you think about domain knowledge, missing data, feature leakage, multicollinearity, temporal stability, and feature importance;
- which machine learning models you have used and how you decide which model to try first;
- why you might choose Random Forest instead of XGBoost, and when XGBoost would likely be the better choice;
- the differences between Logistic Regression and Random Forest in terms of assumptions, interpretability, nonlinearity, feature engineering needs, training behavior, and calibration;
- how you validate model results, including train/validation/test strategy, cross-validation, time-based splits when relevant, metric selection for class imbalance, and how you check whether the model will generalize.
Use a concrete project example if possible, and explain not only what you did, but why you made those choices.
Quick Answer: This question evaluates a candidate's competency in end-to-end machine learning project execution, including data processing workflows, feature creation and selection, model choice (e.g., linear versus tree-based models), and validation strategies.