Handle missing and unavailable predictive features
Company: Intuit
Role: Product Analyst
Category: Machine Learning
Difficulty: easy
Interview Round: Onsite
## Scenario
You are building a model to predict whether a user will **successfully file taxes** (binary label `success`) for a TurboTax-like product.
One of the most predictive features is:
- `session_count` = **cumulative number of sessions** a user has had in the product.
However:
- In the training dataset, `session_count` has many values that are **0** and many that are **missing**.
- In production, stakeholders claim that `session_count` is **not available at scoring time** (i.e., when you need to make the prediction), even though it appears in the schema.
- Exploratory analysis shows `session_count` is **negatively correlated** with `success`.
## Questions
1. **Data quality / missingness:** How would you investigate why `session_count` is often `0` or missing, and how would you treat these cases during modeling?
2. **Training-serving skew:** If `session_count` is not available at inference time, what are your options? How do you decide whether to (a) drop it, (b) engineer a proxy, or (c) change the prediction timing / problem definition?
3. **Interpretation:** Provide at least two plausible explanations for the negative correlation between `session_count` and `success` (including an “opposite viewpoint”), and describe what additional data or analyses you would use to validate/refute each explanation.
## Constraints / expectations
- Assume you have standard product event logs available in principle (page views, step completions, timestamps), but instrumentation may be imperfect.
- Your answer should cover: leakage risk, feature availability, and how you would communicate tradeoffs to stakeholders.
Quick Answer: This question evaluates a product analyst's machine learning competencies in diagnosing data quality issues, handling feature availability and training-serving skew, and interpreting counterintuitive correlations, testing both conceptual understanding and practical application for model development and deployment.