This question evaluates predictive modeling and data-science competencies—exploratory data analysis, feature assessment, problem framing (regression vs classification vs ordinal), model selection and validation, evaluation metric choice, and post-model feature-importance interpretation—in the Machine Learning domain, combining conceptual understanding with practical application. It is commonly asked in technical interviews for Data Scientist roles because it tests end-to-end modeling judgment, reasoning about data distributions and variable usefulness, selection of validation and evaluation strategies, and awareness of pitfalls such as collinearity and data leakage.
You are given a clean CSV dataset about red wine. The target (dependent) variable is:
quality
(integer): wine quality score on a
1–7
scale.
There are ~10 input (independent) variables describing the wine’s chemical properties (all numeric), e.g.:
fixed_acidity
(float)
volatile_acidity
(float)
citric_acid
(float)
residual_sugar
(float)
chlorides
(float)
free_sulfur_dioxide
(float)
total_sulfur_dioxide
(float)
density
(float)
pH
(float)
sulphates
(float)
alcohol
(float)
Assume:
quality
, and why? Mention at least two different ways to assess this (e.g., correlation, mutual information, monotonic trends, domain reasoning).
quality
. You may choose any approach. Clearly specify:
Deliverable: a brief write-up of your approach and results; optionally include pseudocode / a code outline in Python (pandas + scikit-learn).
Login required