Critique and Test Python Preprocessing Utilities Effectively
Company: Capital One
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates a candidate's ability to review and critique Python preprocessing utilities, covering competencies in software design patterns (fit/transform separation), coding style and import hygiene, and unit-test specification for data-science pipelines.
Constraints
- 0 <= len(train), len(values) <= 100000
- Each element is either an integer, a float, or None
- Quartiles must be computed from `train` only, never from `values`
Examples
Input: ([10, 12, 13, 14, 15, 16, 18, 19, 100], [5, 15, 30])
Expected Output: [5, 15, 27.5]
Explanation: From the training data, Q1 = 12.5 and Q3 = 18.5, so IQR = 6. The bounds are 3.5 and 27.5. Only 30 is above the upper bound, so it is clipped to 27.5.
Input: ([None, 1, 2, 2, 3, 4, 100], [None, -5, 8, 4])
Expected Output: [None, -1.0, 7.0, 4]
Explanation: Ignoring None, the sorted training data is [1, 2, 2, 3, 4, 100]. Q1 = 2, Q3 = 4, so IQR = 2 and the bounds are -1 and 7. None stays None, -5 becomes -1, 8 becomes 7, and 4 stays unchanged.
Input: ([None, 7], [100, None, -3])
Expected Output: [100, None, -3]
Explanation: There is only one non-None training value, so the function must return `values` unchanged.
Input: ([2, 2, 2, 2], [2, 3, 1])
Expected Output: [2, 2.0, 2.0]
Explanation: Q1 = Q3 = 2, so IQR = 0 and both bounds are exactly 2. Any value not equal to 2 is clipped to 2.
Input: ([], [])
Expected Output: []
Explanation: Empty training data has fewer than 2 usable values, so the output is the input `values`, which is also empty.
Hints
- Write a small helper function to compute the median of a sorted list, then reuse it for Q1 and Q3.
- This is a fit/transform pattern: learn the clipping bounds from `train` once, then apply them to every element in `values`.