Code Review: OutlierHandler and Imputer Classes
You are given a Python module that implements one OutlierHandler class and three Imputer classes for preprocessing tabular data. The classes appear intended for machine-learning pipelines, but the style and test coverage are mixed.
Assume OutlierHandler detects outliers per feature using rules such as IQR capping or z-score thresholds, and the imputer classes learn statistics during fit and fill missing values during transform.
Constraints & Assumptions
-
Treat the classes as stateful preprocessing components for train/validation/test pipelines.
-
Focus on code quality, API design, correctness, leakage prevention, and testing.
-
Do not assume access to production internals beyond the class behavior described.
-
Discuss both behavior and maintainability.
Clarifying Questions to Ask
-
Should the classes follow scikit-learn's estimator API exactly?
-
Are inputs NumPy arrays, pandas DataFrames, or both?
-
Should transforms preserve column names, dtypes, indexes, and missing-value markers?
-
Are outliers capped, removed, replaced, or flagged?
Part 1 - OutlierHandler Summary
Provide a high-level summary of what the OutlierHandler class does.
What This Part Should Cover
-
Explain that it learns per-feature thresholds during
fit
.
-
Explain that
transform
applies stored thresholds to new data consistently.
-
Mention strategies such as IQR, z-score, capping, masking, or replacement.
-
Connect the class to ML preprocessing pipelines.
Part 2 - Fit and Transform Separation
Explain why separating fit and transform into two methods matters.
What This Part Should Cover
-
Prevent data leakage by learning statistics only on training data.
-
Ensure validation, test, and production data are transformed consistently.
-
Support pipelines, cross-validation, serialization, and reproducibility.
-
Clarify behavior when
transform
is called before
fit
.
Part 3 - Code Quality and Testing
Evaluate the code quality and propose tests.
What This Part Should Cover
-
Review API consistency, input validation, error handling, documentation, naming, type handling, and edge cases.
-
Test missing values, constant columns, all-null columns, mixed dtypes, unseen categories, extreme values, small samples, and transform-before-fit errors.
-
Test shape preservation, no mutation of inputs, deterministic output, and parity across train/test.
-
Include unit tests and integration tests inside a simple ML pipeline.
Follow-up Questions
-
How would you make the classes compatible with scikit-learn pipelines?
-
What bug would you expect if thresholds are recomputed during transform?
-
How would you test the behavior on a DataFrame with nonnumeric columns?