PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates a candidate's competency in data-science engineering tasks including code review, data preprocessing techniques (outlier handling and imputation), reproducibility via virtual environments, and design and unit testing of data pipelines.

  • easy
  • Capital One
  • Coding & Algorithms
  • Data Scientist

Review Preprocessing Code and Tests

Company: Capital One

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: easy

Interview Round: Onsite

You are reviewing a small Python preprocessing codebase during an interview. You do not need to write code. **Part A: Environment and execution** A shell script activates a Python virtual environment and runs a data-processing job. Explain what such a script is typically doing and why isolated virtual environments are useful in collaborative analytics work. **Part B: Outlier processor** You are shown an `OutlierProcessor` class with three methods: - `input_check(df, columns)`: validates inputs, - `fit(df)`: computes lower and upper percentile cutoffs for selected columns, - `transform(df)`: truncates values outside those cutoffs. Explain how this class should work, what failure cases you would look for, and what unit tests you would add. **Part C: Imputer review** You are shown a messy `Imputer` class that implements `mean`, `median`, and `mode` filling strategies. What design, readability, and reliability improvements would you recommend?

Quick Answer: This question evaluates a candidate's competency in data-science engineering tasks including code review, data preprocessing techniques (outlier handling and imputation), reproducibility via virtual environments, and design and unit testing of data pipelines.

Part 1: Validate a Virtual Environment Script

A simplified shell script is represented as a list of commands. Activating a virtual environment creates or reuses an isolated package space; installing a package affects only the active environment; running a job succeeds only if all of that job's required packages are installed in the currently active environment. Validate the script and report the first failure, or return how many jobs ran successfully.

Constraints

  • 0 <= len(commands) <= 10^5
  • Each command is a tuple of the form ('activate', env), ('install', package), ('run', job), or ('deactivate',)
  • Environment, package, and job names are strings
  • Installed packages persist inside an environment even after deactivation and later reactivation

Examples

Input: ([('activate', 'env1'), ('install', 'pandas'), ('install', 'numpy'), ('run', 'daily'), ('deactivate',), ('activate', 'env2'), ('install', 'numpy'), ('run', 'quick'), ('deactivate',), ('activate', 'env1'), ('run', 'daily')], {'daily': ['pandas', 'numpy'], 'quick': ['numpy']})

Expected Output: {'status': 'ok', 'runs': 3}

Explanation: env1 keeps its packages after being deactivated, so the final run succeeds too.

Input: ([('activate', 'env1'), ('install', 'pandas'), ('run', 'daily')], {'daily': ['pandas', 'numpy']})

Expected Output: {'status': 'error', 'step': 3, 'reason': 'MISSING_PACKAGE'}

Explanation: daily needs numpy, but it was never installed in env1.

Input: ([('install', 'pandas')], {'daily': ['pandas']})

Expected Output: {'status': 'error', 'step': 1, 'reason': 'NO_ACTIVE_ENV'}

Explanation: Packages can only be installed inside an active environment.

Input: ([('activate', 'a'), ('activate', 'b')], {})

Expected Output: {'status': 'error', 'step': 2, 'reason': 'ALREADY_ACTIVE'}

Explanation: A second environment cannot be activated before the first one is deactivated.

Input: ([], {})

Expected Output: {'status': 'error', 'step': 1, 'reason': 'NO_RUN'}

Explanation: Edge case: an empty script never runs a job.

Hints

  1. Simulate the script with a current active environment and a mapping from environment name to its installed packages.
  2. Fail immediately on the first invalid command; do not continue processing after an error.

Part 2: Percentile Outlier Capping

Implement the core of an outlier processor. Given fit_rows and transform_rows as table-like data, first validate the inputs, then compute lower and upper percentile cutoffs for selected numeric columns using linear interpolation. Finally, cap every selected value in transform_rows so it lies within that column's fitted range. Return 'INVALID_INPUT' if the data is malformed.

Constraints

  • 1 <= len(fit_rows) <= 10^4 on valid inputs
  • 0 <= len(transform_rows) <= 10^4
  • Each row is a dictionary
  • Selected columns must exist in every row and contain int or float values; bool is invalid
  • 0 <= lower_percentile <= upper_percentile <= 100

Examples

Input: ([{'a': 1, 'b': 10}, {'a': 2, 'b': 100}, {'a': 3, 'b': 20}, {'a': 100, 'b': 30}], [{'a': -5, 'b': 5}, {'a': 50, 'b': 150}], ['a', 'b'], 25, 75)

Expected Output: [{'a': 1.75, 'b': 17.5}, {'a': 27.25, 'b': 47.5}]

Explanation: The 25th and 75th percentile cutoffs are computed from fit_rows, then each transform value is capped into that range.

Input: ([{'x': 5}], [{'x': -100}, {'x': 20}, {'x': 5}], ['x'], 10, 90)

Expected Output: [{'x': 5.0}, {'x': 5.0}, {'x': 5.0}]

Explanation: Edge case: with one fit row, both cutoffs are the same value.

Input: ([{'x': 1}, {'x': 2}], [], ['x'], 0, 100)

Expected Output: []

Explanation: Edge case: transforming an empty dataset returns an empty list.

Input: ([{'x': 1}], [{'y': 2}], ['x'], 0, 100)

Expected Output: 'INVALID_INPUT'

Explanation: The selected column x is missing from transform_rows.

Input: ([], [{'x': 2}], ['x'], 0, 100)

Expected Output: 'INVALID_INPUT'

Explanation: Edge case: fitting percentiles requires at least one row.

Hints

  1. Compute cutoffs from fit_rows only; do not use transform_rows when fitting percentiles.
  2. After you know each column's [low, high] range, transforming is just clamping each selected value into that interval.

Part 3: Robust Missing-Value Imputer

Implement a reliable imputer for messy tabular data. For each selected column, compute a fill value using one strategy: mean, median, or mode. Then replace only None values in those columns. For mode ties, choose the smallest value. Return 'INVALID_INPUT' for unsupported strategies, missing columns, or columns that have no observed values.

Constraints

  • 0 <= len(rows) <= 10^4
  • rows is a list of dictionaries and columns is a non-empty list of strings
  • Missing values are represented by None
  • For mean and median, selected non-missing values must be numeric; bool is invalid
  • For mode, values in a selected column must all be of one comparable type

Examples

Input: ([{'x': 1, 'y': None}, {'x': 3, 'y': 4}, {'x': None, 'y': 6}], ['x', 'y'], 'mean')

Expected Output: [{'x': 1, 'y': 5.0}, {'x': 3, 'y': 4}, {'x': 2.0, 'y': 6}]

Explanation: x is filled with mean 2.0 and y is filled with mean 5.0.

Input: ([{'a': 1}, {'a': 9}, {'a': None}, {'a': 11}, {'a': 100}, {'a': None}], ['a'], 'median')

Expected Output: [{'a': 1}, {'a': 9}, {'a': 10.0}, {'a': 11}, {'a': 100}, {'a': 10.0}]

Explanation: The sorted non-missing values are [1, 9, 11, 100], so the median is 10.0.

Input: ([{'c': 'b'}, {'c': None}, {'c': 'a'}, {'c': 'b'}, {'c': 'a'}], ['c'], 'mode')

Expected Output: [{'c': 'b'}, {'c': 'a'}, {'c': 'a'}, {'c': 'b'}, {'c': 'a'}]

Explanation: Both 'a' and 'b' appear twice, so the tie is broken by choosing the smaller value 'a'.

Input: ([], ['a'], 'mean')

Expected Output: []

Explanation: Edge case: an empty table stays empty.

Input: ([{'x': None}, {'x': None}], ['x'], 'mean')

Expected Output: 'INVALID_INPUT'

Explanation: Edge case: a selected column with no observed values cannot be imputed.

Hints

  1. Do this in two passes: first compute one fill value per selected column, then fill the missing cells.
  2. Median needs sorting; mode needs counting. Be careful with the all-missing-column edge case.
Last updated: Apr 19, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Solve Four Coding Assessment Tasks - Capital One (medium)
  • Write SQL using joins and window functions - Capital One (medium)
  • Remove nodes with a given value - Capital One (medium)
  • Solve multiple algorithmic interview questions - Capital One (hard)
  • Place Pieces on a Grid - Capital One (medium)