Python/Pandas Data Manipulation
Asked of: Data Scientist
Last updated

What's being tested
This tests analysis-grade data manipulation in pandas and SQL: cleaning messy inputs, joining heterogeneous tables, aggregating by time/customer/product segments, and ranking or deduplicating records correctly. Interviewers are probing whether you can produce trustworthy metric tables under realistic ambiguity: duplicate events, currency normalization, date granularity, missing values, and tie-breaking.
Patterns & templates
-
Groupby aggregation in
pandas:df.groupby(keys).agg(...)for revenue, counts, kWh, or sales totals; validate row grain before aggregating. -
Time bucketing with
pd.to_datetime,.dt.date,.dt.to_period("M"), or SQLDATE_TRUNC; avoid mixing timestamps and dates accidentally. -
Deduplication by business key using
drop_duplicates(subset=..., keep=...)orROW_NUMBER() OVER (...); declare deterministic tie-breakers. -
Join then normalize pattern: merge facts to lookup tables like exchange rates using
merge; check many-to-one assumptions before computing converted metrics. -
Ranking within groups via
rank,sort_values,cumcount, or SQLDENSE_RANK; specify whether ties should share rank. -
Conditional segmentation using
np.where,pd.cut,CASE WHEN, and boolean masks for Prime/non-Prime, price buckets, or customer cohorts. -
Streaming/counting basics for text-like inputs: use
collections.Counteror plaindict; Unicode normalization withunicodedata.normalizeand regex tokenization.
Common pitfalls
Pitfall: Aggregating before fixing grain. If order lines are duplicated or salaries repeat by country/date, totals and ranks become silently wrong.
Pitfall: Treating date joins as exact timestamp joins. Exchange rates, sales days, and meter readings often require explicit date extraction or as-of logic.
Pitfall: Returning code without explaining assumptions. Say how you handle nulls, duplicates, ties, currencies, and timezone/date boundaries.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Featured in interview prep guides
Practice questions
- Implement robust word counts and min/maxAmazon · Data Scientist · Onsite · Medium
- Append country tables and rank salaries in USDAmazon · Data Scientist · Technical Screen · Medium
- Design SQL/Pandas aggregations on retail schemaAmazon · Data Scientist · Technical Screen · Medium
- Calculate Weekly, Monthly Hours Watched by Premium UsersAmazon · Data Scientist · Technical Screen · Medium
- Analyze Monthly Prime vs Non-Prime Sales and Price BucketsAmazon · Data Scientist · Onsite · Medium
- Create Country-Level Spend Report Using PandasAmazon · Data Scientist · Technical Screen · Medium
- Compute 3-Day Rolling Revenue Averages with PandasAmazon · Data Scientist · Onsite · Medium
- Identify Date with Highest Total kWh Consumption Using PandasAmazon · Data Scientist · Onsite · Medium
- Explore Subscription Patterns and Status Transitions with SQL/PandasAmazon · Data Scientist · Technical Screen · Medium
- Identify Top Spenders and Segment Customers Using PythonAmazon · Data Scientist · Technical Screen · Medium
Related concepts
- Pandas Data ManipulationData Manipulation (SQL/Python)
- SQL And Python Data ManipulationData Manipulation (SQL/Python)
- Python, Pandas, NumPy, And R Data ManipulationData Manipulation (SQL/Python)
- Python Data Manipulation And Core CodingCoding & Algorithms
- SQL/Python Data Manipulation And JoinsData Manipulation (SQL/Python)
- Pandas Data WranglingData Manipulation (SQL/Python)