This question evaluates proficiency in logistic regression theory and regularization, GLM versus OLS assumptions, class imbalance handling and calibration, temporal validation to avoid leakage, correlated-feature penalty effects, and business-threshold decisioning for expected value, within the Machine Learning domain for a Data Scientist role.
Context: You are building a churn propensity model (y ∈ {0,1}) using logistic regression for a subscription business. Positives (churners) are 3% of samples. Answer each part precisely and concisely.
List the OLS assumptions. For each assumption that is relevant to GLMs (e.g., multicollinearity, omitted variables, measurement error, non‑IID), explain how violations manifest in logistic regression and how regularization, feature engineering, or robust inference address them.
Define a temporal validation scheme that avoids leakage. Include: feature freeze date, out‑of‑time test window, and a k‑fold strategy compatible with time. Specify the exact splits on a 6‑month dataset.
With correlated features, contrast L1 vs. L2 on sparsity, stability, and interpretability. Propose a workflow that yields a sparse, stable model with confidence intervals for odds ratios.
Give one business‑aligned decision rule for choosing the score threshold using asymmetric costs, and show how to compute the expected value uplift over a "message all" policy.
Login required