Pandas Data Manipulation
Asked of: Data Scientist
Last updated

What's being tested
Pandas data manipulation here means turning messy local CSVs into trustworthy analytical tables: read, normalize, join, aggregate, pivot, and validate results. Interviewers are probing whether you can prevent silent analytical errors, especially duplicate joins, null mishandling, date parsing mistakes, and brittle preprocessing code.
Patterns & templates
-
Robust CSV ingestion with
`pd.read_csv()`: setdtype,parse_dates,encoding,sep,usecols; inspectshape,head(),isna().mean()before merging. -
Safe joins with
`df.merge(..., how=..., on=..., validate=...)`: usevalidate='one_to_one','many_to_one', or'one_to_many'to catch row explosions. -
Metric aggregation via
`groupby().agg()`: compute revenue, impressions, clicks, or conversions at the correct grain before calculating ratios likeCTR = clicks / impressions. -
Pivoted reporting with
`pivot_table(index=..., columns=..., values=..., aggfunc='sum', fill_value=0)`: confirm totals match the pre-pivot`groupby`output. -
Date handling with
`pd.to_datetime(errors='coerce')`,.dt.to_period(),`sort_values()`, and`ffill()`; always define calendar gaps versus true missing observations. -
Null and coalescing logic using
`fillna()`,`combine_first()`,`where()`, and`np.select()`; distinguish missing data from valid zeros in metrics. -
Code robustness: factor transformations into pure functions, add input validation, unit-test edge cases with
`pytest`, and document expected schema and complexity.
Common pitfalls
Pitfall: Merging raw fact tables before aggregation can create duplicate rows and inflate revenue, clicks, or impressions.
Pitfall: Computing
CTRas the mean of row-level rates instead ofsum(clicks) / sum(impressions)gives the wrong weighted metric.
Pitfall: Treating every missing date as needing
`ffill()`can invent activity; clarify whether absence means zero, unknown, or carry-forward state.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Featured in interview prep guides
Practice questions
- Merge CSVs and build revenue pivot with pandasCapital One · Data Scientist · Take-home Project · Medium
- Merge four CSVs locally, robustly and efficientlyCapital One · Data Scientist · Technical Screen · Medium
- Impute, join, and upsert using SQL and PythonCapital One · Data Scientist · Technical Screen · Medium
- Refactor code and enforce robustnessCapital One · Data Scientist · Onsite · medium
- Merge ad CSVs and compute CTRCapital One · Data Scientist · Take-home Project · Medium
- Compute Customer Spend and Engineer Features for 2023Capital One · Data Scientist · Onsite · Medium
- Evaluate OutlierHandler Class for Code Quality and TestingCapital One · Data Scientist · Onsite · medium
Related concepts
- Python/Pandas Data ManipulationData Manipulation (SQL/Python)
- Pandas Data WranglingData Manipulation (SQL/Python)
- SQL/Python Data Manipulation And JoinsData Manipulation (SQL/Python)
- Python, Pandas, NumPy, And R Data ManipulationData Manipulation (SQL/Python)
- Python Data Manipulation And Core CodingCoding & Algorithms
- SQL And Python Data ManipulationData Manipulation (SQL/Python)