PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCareers

Quick Overview

This question evaluates a candidate's ability to review and critique Python preprocessing utilities, covering competencies in software design patterns (fit/transform separation), coding style and import hygiene, and unit-test specification for data-science pipelines.

  • Medium
  • Capital One
  • Coding & Algorithms
  • Data Scientist

Critique and Test Python Preprocessing Utilities Effectively

Company: Capital One

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

##### Scenario Code review of Python preprocessing utilities (OutlierHandler and three Imputer classes) ##### Question Summarize, at a high level, what the OutlierHandler class accomplishes. Why is keeping fit and transform as separate methods advantageous? Identify and critique any coding-style or import problems found in the three Imputer classes. Hand-write one critical unit test you would add for OutlierHandler. ##### Hints Think scikit-learn patterns, PEP-8, avoiding "from numpy import *", assert expected vs. actual.

Quick Answer: This question evaluates a candidate's ability to review and critique Python preprocessing utilities, covering competencies in software design patterns (fit/transform separation), coding style and import hygiene, and unit-test specification for data-science pipelines.

In many preprocessing pipelines, a utility first learns statistics from training data (`fit`) and later applies them to any dataset (`transform`). Implement a simplified OutlierHandler. Your function receives two lists: - `train`: the data used to learn outlier thresholds - `values`: the data to transform Rules: 1. Ignore `None` values in `train`. 2. Sort the remaining values and compute quartiles using this definition: - If the number of values is odd, exclude the overall median when splitting into lower and upper halves. - `Q1` is the median of the lower half. - `Q3` is the median of the upper half. - The median of an even-length list is the average of the two middle values. 3. Compute `IQR = Q3 - Q1`. 4. Compute bounds: - `lower = Q1 - 1.5 * IQR` - `upper = Q3 + 1.5 * IQR` 5. Transform each element in `values`: - keep `None` as `None` - if a number is below `lower`, replace it with `lower` - if a number is above `upper`, replace it with `upper` - otherwise leave it unchanged If `train` contains fewer than 2 non-`None` values, return `values` unchanged. Return the transformed list.

Constraints

  • 0 <= len(train), len(values) <= 100000
  • Each element is either an integer, a float, or None
  • Quartiles must be computed from `train` only, never from `values`

Examples

Input: ([10, 12, 13, 14, 15, 16, 18, 19, 100], [5, 15, 30])

Expected Output: [5, 15, 27.5]

Explanation: From the training data, Q1 = 12.5 and Q3 = 18.5, so IQR = 6. The bounds are 3.5 and 27.5. Only 30 is above the upper bound, so it is clipped to 27.5.

Input: ([None, 1, 2, 2, 3, 4, 100], [None, -5, 8, 4])

Expected Output: [None, -1.0, 7.0, 4]

Explanation: Ignoring None, the sorted training data is [1, 2, 2, 3, 4, 100]. Q1 = 2, Q3 = 4, so IQR = 2 and the bounds are -1 and 7. None stays None, -5 becomes -1, 8 becomes 7, and 4 stays unchanged.

Input: ([None, 7], [100, None, -3])

Expected Output: [100, None, -3]

Explanation: There is only one non-None training value, so the function must return `values` unchanged.

Input: ([2, 2, 2, 2], [2, 3, 1])

Expected Output: [2, 2.0, 2.0]

Explanation: Q1 = Q3 = 2, so IQR = 0 and both bounds are exactly 2. Any value not equal to 2 is clipped to 2.

Input: ([], [])

Expected Output: []

Explanation: Empty training data has fewer than 2 usable values, so the output is the input `values`, which is also empty.

Solution

def solution(train, values):
    def median(arr):
        n = len(arr)
        mid = n // 2
        if n % 2 == 1:
            return arr[mid]
        return (arr[mid - 1] + arr[mid]) / 2

    clean = sorted(x for x in train if x is not None)

    if len(clean) < 2:
        return list(values)

    n = len(clean)
    mid = n // 2

    if n % 2 == 1:
        lower_half = clean[:mid]
        upper_half = clean[mid + 1:]
    else:
        lower_half = clean[:mid]
        upper_half = clean[mid:]

    q1 = median(lower_half)
    q3 = median(upper_half)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr

    result = []
    for x in values:
        if x is None:
            result.append(None)
        elif x < lower:
            result.append(lower)
        elif x > upper:
            result.append(upper)
        else:
            result.append(x)

    return result

Time complexity: O(n log n + m), where n = len(train) and m = len(values). Space complexity: O(n).

Hints

  1. Write a small helper function to compute the median of a sorted list, then reuse it for Q1 and Q3.
  2. This is a fit/transform pattern: learn the clipping bounds from `train` once, then apply them to every element in `values`.
Last updated: Apr 28, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • Careers
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Solve Four Coding Assessment Tasks - Capital One (medium)
  • Write SQL using joins and window functions - Capital One (medium)
  • Review Preprocessing Code and Tests - Capital One (easy)
  • Remove nodes with a given value - Capital One (medium)
  • Solve multiple algorithmic interview questions - Capital One (hard)