Clean, split, merge, and aggregate with pandas

Q: Clean, split, merge, and aggregate with pandas

This question evaluates proficiency with pandas data manipulation and cleaning, including handling missing values, trimming and splitting strings, numeric type conversion, merges, time-aware ordering, and aggregation for per-entity summaries.

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Technical Screen rounds at Uber.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Uber during technical interviews.

Question

Given two CSVs, use pandas to clean, split strings, merge, and aggregate.

drivers.csv
 driver_id,name,signup_city
 D1,Jane Doe,SF
 D2,Mark S,NYC
 D3,Adam L,LA

trips.csv
 trip_id,driver_id,ts_utc,route,fare_usd
 T1,D1,2019-01-01T08:00:00Z,SF-CA|SFO,17.52
 T2,D1,2019-01-02T09:00:00Z,SF-CA|DAL,4.40
 T3,D2,2019-01-02T10:00:00Z,,
 T4,D3,2019-01-03T11:00:00Z,LA-CA|LAX,8.00

Tasks:
1) Load both files into DataFrames; show head(2) and tail(1) of trips to verify ingest.
2) Drop rows in trips with missing fare_usd or missing/empty route (after stripping whitespace). Ensure fare_usd is numeric.
3) Split the route column on '|' into origin and destination columns; trim whitespace. If the split yields fewer than 2 tokens, drop those rows.
4) Merge the cleaned trips with drivers on driver_id (left join from trips) and keep only rows with a matching driver.
5) Produce a per-driver summary with columns: driver_id, name, trips_count, avg_fare_usd (rounded to 2 decimals), last_3_trips_avg (average of the last 3 trips per driver ordered by ts_utc; if <3 trips, average over available).
6) Return the top 2 drivers by avg_fare_usd DESC (break ties by trips_count DESC, then name ASC) and print the final DataFrame schema (dtypes) to confirm transformations.
Explicitly use: head, tail, dropna, str.split, merge, groupby, sort_values, and rolling/agg as appropriate.

PracHub · Accepted Answer

This question evaluates proficiency with pandas data manipulation and cleaning, including handling missing values, trimming and splitting strings, numeric type conversion, merges, time-aware ordering, and aggregation for per-entity summaries.

Quick Overview

Quick Overview