Python, Pandas, NumPy, And R Data Manipulation
Asked of: Data Scientist
Last updated

What's being tested
This tests vectorized tabular manipulation in pandas, NumPy, and dplyr: create derived columns, join lookup tables, compute group aggregates, and run small simulations without row-by-row loops. Interviewers are probing whether you can write correct, scalable analysis code while handling missing values, type coercion, random sampling, and edge cases.
Patterns & templates
-
Vectorized conditionals — use
np.select,np.where,case_when, or boolean masks; encode precedence explicitly from most-specific to least-specific condition. -
Group-wise transforms — use
df.groupby(keys)[col].transform('mean')to broadcast aggregates back to rows; indplyr, usegroup_by()plusmutate(). -
Join then mutate — use
left_join()/merge(..., how='left')to attach treatment parameters, then compute adjusted values; validate row counts after joins. -
Random simulation — use
sample_n,slice_sample,np.random.binomial, ornp.random.default_rng; set seeds for reproducibility and avoid repeated loops when vectorization works. -
Column normalization — compute column sums with
axis=0, divide via broadcasting, and define behavior for zero-sum columns before coding. -
Missing-value semantics —
NaNcomparisons are false inpandas; useisna(),notna(),fillna(), and nullable dtypes deliberately. -
Complexity expectations — most solutions should be
O(n)orO(n + k)time with linear memory; avoidapply(axis=1)unless data is tiny or logic is non-vectorizable.
Common pitfalls
Pitfall: Treating
NaN == NaNas true or using normal comparisons on missing numeric fields; useisna()/notna()instead.
Pitfall: Creating many-to-many joins accidentally and inflating rows; check key uniqueness and compare pre/post row counts.
Pitfall: Normalizing by a zero column sum and returning
inforNaNunintentionally; specify whether to keep zeros, returnNaN, or skip the column.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Featured in interview prep guides
Practice questions
- Generate binomial matrix and column-normalizeGoogle · Data Scientist · Technical Screen · Medium
- Write SQL/Python for messy event dataGoogle · Data Scientist · Technical Screen · Medium
- Add a conditional column in PythonGoogle · Data Scientist · Technical Screen · Medium
- Implement R dplyr simulation and left joinGoogle · Data Scientist · Technical Screen · Medium
- Calculate User Deviation from Team Average MessagesGoogle · Data Scientist · Technical Screen · Medium
- Sample and Simulate Price Adjustments in R with dplyrGoogle · Data Scientist · Technical Screen · Medium
Related concepts
- Python/Pandas Data ManipulationData Manipulation (SQL/Python)
- Python Data Manipulation And Core CodingCoding & Algorithms
- Pandas Data ManipulationData Manipulation (SQL/Python)
- Pandas Data WranglingData Manipulation (SQL/Python)
- SQL And Python Data ManipulationData Manipulation (SQL/Python)
- SQL/Python Data Manipulation And JoinsData Manipulation (SQL/Python)