Build a model to predict wine quality
Company: EvenUp
Role: Data Scientist
Category: Machine Learning
Difficulty: Medium
Interview Round: Technical Screen
## Modeling task: Predict wine quality from a CSV
You are given a clean CSV dataset about red wine. The target (dependent) variable is:
- `quality` (integer): wine quality score on a **1–7** scale.
There are ~10 input (independent) variables describing the wine’s chemical properties (all numeric), e.g.:
- `fixed_acidity` (float)
- `volatile_acidity` (float)
- `citric_acid` (float)
- `residual_sugar` (float)
- `chlorides` (float)
- `free_sulfur_dioxide` (float)
- `total_sulfur_dioxide` (float)
- `density` (float)
- `pH` (float)
- `sulphates` (float)
- `alcohol` (float)
Assume:
- There are **no missing values**.
- Each row is one wine sample; samples are i.i.d. (unless you discover evidence otherwise).
### Questions
1. **EDA:** What do you learn from exploring the dataset (distributions, outliers, correlations, target imbalance, non-linearities)? List at least 3 concrete findings and how they affect modeling choices.
2. **Feature usefulness (pre-model):** Which variables appear likely to be useful for predicting `quality`, and why? Mention at least two different ways to assess this (e.g., correlation, mutual information, monotonic trends, domain reasoning).
3. **Modeling:** Build a model to predict `quality`. You may choose any approach. Clearly specify:
- whether you treat the task as **regression**, **classification**, or **ordinal classification**, and why
- train/validation strategy (e.g., split or cross-validation)
- evaluation metric(s)
4. **Feature importance (post-model):** How would you determine which variables are actually useful in your final model? Provide a method appropriate to your model choice and explain pitfalls (e.g., collinearity, leakage, bias in impurity-based importances).
Deliverable: a brief write-up of your approach and results; optionally include pseudocode / a code outline in Python (pandas + scikit-learn).
Quick Answer: This question evaluates predictive modeling and data-science competencies—exploratory data analysis, feature assessment, problem framing (regression vs classification vs ordinal), model selection and validation, evaluation metric choice, and post-model feature-importance interpretation—in the Machine Learning domain, combining conceptual understanding with practical application. It is commonly asked in technical interviews for Data Scientist roles because it tests end-to-end modeling judgment, reasoning about data distributions and variable usefulness, selection of validation and evaluation strategies, and awareness of pitfalls such as collinearity and data leakage.