Review Preprocessing Code and Tests
Company: Capital One
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: easy
Interview Round: Onsite
Quick Answer: This question evaluates a candidate's competency in data-science engineering tasks including code review, data preprocessing techniques (outlier handling and imputation), reproducibility via virtual environments, and design and unit testing of data pipelines.
Part 1: Validate a Virtual Environment Script
Constraints
- 0 <= len(commands) <= 10^5
- Each command is a tuple of the form ('activate', env), ('install', package), ('run', job), or ('deactivate',)
- Environment, package, and job names are strings
- Installed packages persist inside an environment even after deactivation and later reactivation
Examples
Input: ([('activate', 'env1'), ('install', 'pandas'), ('install', 'numpy'), ('run', 'daily'), ('deactivate',), ('activate', 'env2'), ('install', 'numpy'), ('run', 'quick'), ('deactivate',), ('activate', 'env1'), ('run', 'daily')], {'daily': ['pandas', 'numpy'], 'quick': ['numpy']})
Expected Output: {'status': 'ok', 'runs': 3}
Explanation: env1 keeps its packages after being deactivated, so the final run succeeds too.
Input: ([('activate', 'env1'), ('install', 'pandas'), ('run', 'daily')], {'daily': ['pandas', 'numpy']})
Expected Output: {'status': 'error', 'step': 3, 'reason': 'MISSING_PACKAGE'}
Explanation: daily needs numpy, but it was never installed in env1.
Input: ([('install', 'pandas')], {'daily': ['pandas']})
Expected Output: {'status': 'error', 'step': 1, 'reason': 'NO_ACTIVE_ENV'}
Explanation: Packages can only be installed inside an active environment.
Input: ([('activate', 'a'), ('activate', 'b')], {})
Expected Output: {'status': 'error', 'step': 2, 'reason': 'ALREADY_ACTIVE'}
Explanation: A second environment cannot be activated before the first one is deactivated.
Input: ([], {})
Expected Output: {'status': 'error', 'step': 1, 'reason': 'NO_RUN'}
Explanation: Edge case: an empty script never runs a job.
Hints
- Simulate the script with a current active environment and a mapping from environment name to its installed packages.
- Fail immediately on the first invalid command; do not continue processing after an error.
Part 2: Percentile Outlier Capping
Constraints
- 1 <= len(fit_rows) <= 10^4 on valid inputs
- 0 <= len(transform_rows) <= 10^4
- Each row is a dictionary
- Selected columns must exist in every row and contain int or float values; bool is invalid
- 0 <= lower_percentile <= upper_percentile <= 100
Examples
Input: ([{'a': 1, 'b': 10}, {'a': 2, 'b': 100}, {'a': 3, 'b': 20}, {'a': 100, 'b': 30}], [{'a': -5, 'b': 5}, {'a': 50, 'b': 150}], ['a', 'b'], 25, 75)
Expected Output: [{'a': 1.75, 'b': 17.5}, {'a': 27.25, 'b': 47.5}]
Explanation: The 25th and 75th percentile cutoffs are computed from fit_rows, then each transform value is capped into that range.
Input: ([{'x': 5}], [{'x': -100}, {'x': 20}, {'x': 5}], ['x'], 10, 90)
Expected Output: [{'x': 5.0}, {'x': 5.0}, {'x': 5.0}]
Explanation: Edge case: with one fit row, both cutoffs are the same value.
Input: ([{'x': 1}, {'x': 2}], [], ['x'], 0, 100)
Expected Output: []
Explanation: Edge case: transforming an empty dataset returns an empty list.
Input: ([{'x': 1}], [{'y': 2}], ['x'], 0, 100)
Expected Output: 'INVALID_INPUT'
Explanation: The selected column x is missing from transform_rows.
Input: ([], [{'x': 2}], ['x'], 0, 100)
Expected Output: 'INVALID_INPUT'
Explanation: Edge case: fitting percentiles requires at least one row.
Hints
- Compute cutoffs from fit_rows only; do not use transform_rows when fitting percentiles.
- After you know each column's [low, high] range, transforming is just clamping each selected value into that interval.
Part 3: Robust Missing-Value Imputer
Constraints
- 0 <= len(rows) <= 10^4
- rows is a list of dictionaries and columns is a non-empty list of strings
- Missing values are represented by None
- For mean and median, selected non-missing values must be numeric; bool is invalid
- For mode, values in a selected column must all be of one comparable type
Examples
Input: ([{'x': 1, 'y': None}, {'x': 3, 'y': 4}, {'x': None, 'y': 6}], ['x', 'y'], 'mean')
Expected Output: [{'x': 1, 'y': 5.0}, {'x': 3, 'y': 4}, {'x': 2.0, 'y': 6}]
Explanation: x is filled with mean 2.0 and y is filled with mean 5.0.
Input: ([{'a': 1}, {'a': 9}, {'a': None}, {'a': 11}, {'a': 100}, {'a': None}], ['a'], 'median')
Expected Output: [{'a': 1}, {'a': 9}, {'a': 10.0}, {'a': 11}, {'a': 100}, {'a': 10.0}]
Explanation: The sorted non-missing values are [1, 9, 11, 100], so the median is 10.0.
Input: ([{'c': 'b'}, {'c': None}, {'c': 'a'}, {'c': 'b'}, {'c': 'a'}], ['c'], 'mode')
Expected Output: [{'c': 'b'}, {'c': 'a'}, {'c': 'a'}, {'c': 'b'}, {'c': 'a'}]
Explanation: Both 'a' and 'b' appear twice, so the tie is broken by choosing the smaller value 'a'.
Input: ([], ['a'], 'mean')
Expected Output: []
Explanation: Edge case: an empty table stays empty.
Input: ([{'x': None}, {'x': None}], ['x'], 'mean')
Expected Output: 'INVALID_INPUT'
Explanation: Edge case: a selected column with no observed values cannot be imputed.
Hints
- Do this in two passes: first compute one fill value per selected column, then fill the missing cells.
- Median needs sorting; mode needs counting. Be careful with the all-missing-column edge case.