Critique and Test Python Preprocessing Utilities Effectively
Company: Capital One
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates a candidate's ability to review and critique Python preprocessing utilities, covering competencies in software design patterns (fit/transform separation), coding style and import hygiene, and unit-test specification for data-science pipelines.
Constraints
- 0 <= len(train), len(values) <= 100000
- Each element is either an integer, a float, or None
- Quartiles must be computed from `train` only, never from `values`
Examples
Input: ([10, 12, 13, 14, 15, 16, 18, 19, 100], [5, 15, 30])
Expected Output: [5, 15, 27.5]
Explanation: From the training data, Q1 = 12.5 and Q3 = 18.5, so IQR = 6. The bounds are 3.5 and 27.5. Only 30 is above the upper bound, so it is clipped to 27.5.
Input: ([None, 1, 2, 2, 3, 4, 100], [None, -5, 8, 4])
Expected Output: [None, -1.0, 7.0, 4]
Explanation: Ignoring None, the sorted training data is [1, 2, 2, 3, 4, 100]. Q1 = 2, Q3 = 4, so IQR = 2 and the bounds are -1 and 7. None stays None, -5 becomes -1, 8 becomes 7, and 4 stays unchanged.
Input: ([None, 7], [100, None, -3])
Expected Output: [100, None, -3]
Explanation: There is only one non-None training value, so the function must return `values` unchanged.
Input: ([2, 2, 2, 2], [2, 3, 1])
Expected Output: [2, 2.0, 2.0]
Explanation: Q1 = Q3 = 2, so IQR = 0 and both bounds are exactly 2. Any value not equal to 2 is clipped to 2.
Input: ([], [])
Expected Output: []
Explanation: Empty training data has fewer than 2 usable values, so the output is the input `values`, which is also empty.
Solution
def solution(train, values):
def median(arr):
n = len(arr)
mid = n // 2
if n % 2 == 1:
return arr[mid]
return (arr[mid - 1] + arr[mid]) / 2
clean = sorted(x for x in train if x is not None)
if len(clean) < 2:
return list(values)
n = len(clean)
mid = n // 2
if n % 2 == 1:
lower_half = clean[:mid]
upper_half = clean[mid + 1:]
else:
lower_half = clean[:mid]
upper_half = clean[mid:]
q1 = median(lower_half)
q3 = median(upper_half)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
result = []
for x in values:
if x is None:
result.append(None)
elif x < lower:
result.append(lower)
elif x > upper:
result.append(upper)
else:
result.append(x)
return resultTime complexity: O(n log n + m), where n = len(train) and m = len(values). Space complexity: O(n).
Hints
- Write a small helper function to compute the median of a sorted list, then reuse it for Q1 and Q3.
- This is a fit/transform pattern: learn the clipping bounds from `train` once, then apply them to every element in `values`.