Identify Risks and Improve Imputation Class Implementations
Scenario
You are reviewing three custom Python imputation classes intended for use in a scikit-learn workflow. Each class fills missing values column-wise using one of the following strategies: mean, median, or mode.
Assume these classes are meant to be sklearn-compatible transformers used within pipelines (fit on train, transform on validation/test) and may be applied to numpy arrays, pandas DataFrames, or sparse matrices.
Task
-
Identify potential problems or risks in these mean/median/mode imputer implementations.
-
Propose concrete improvements or refactors to make them robust, reusable, and compliant with the sklearn interface.
Hints
Consider: inheritance and API compliance, dtype handling (numeric, boolean, categorical, datetime), sparse data, incremental/streaming fit, edge cases (all-missing columns, ties for mode), performance, and testability.
Constraints & Assumptions
-
Preserve the scope, facts, inputs, and requested outputs from the prompt above.
-
If the prompt leaves a detail unspecified, state a reasonable assumption before relying on it.
-
Keep the answer interview-ready: concise enough to present, but concrete enough to implement or evaluate.
Clarifying Questions to Ask
-
Clarify the task, data shape, labels, constraints, and evaluation metric.
-
State assumptions behind the math or modeling technique you choose.
-
Connect theory to practical training, debugging, and deployment implications.
What a Strong Answer Covers
-
Correct definitions and formulas where the prompt requires them.
-
A practical explanation of how the method behaves on real data.
-
Trade-offs, failure modes, diagnostics, and mitigation strategies.
-
Evaluation choices that match the product or modeling objective.
Follow-up Questions
-
How would noisy labels, class imbalance, or distribution shift affect the answer?
-
What would you monitor after deployment?
-
Which baseline would you compare against first?