Clean, split, merge, and aggregate with pandas
Company: Uber
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
Given two CSVs, use pandas to clean, split strings, merge, and aggregate.
drivers.csv
driver_id,name,signup_city
D1,Jane Doe,SF
D2,Mark S,NYC
D3,Adam L,LA
trips.csv
trip_id,driver_id,ts_utc,route,fare_usd
T1,D1,2019-01-01T08:00:00Z,SF-CA|SFO,17.52
T2,D1,2019-01-02T09:00:00Z,SF-CA|DAL,4.40
T3,D2,2019-01-02T10:00:00Z,,
T4,D3,2019-01-03T11:00:00Z,LA-CA|LAX,8.00
Tasks:
1) Load both files into DataFrames; show head(2) and tail(1) of trips to verify ingest.
2) Drop rows in trips with missing fare_usd or missing/empty route (after stripping whitespace). Ensure fare_usd is numeric.
3) Split the route column on '|' into origin and destination columns; trim whitespace. If the split yields fewer than 2 tokens, drop those rows.
4) Merge the cleaned trips with drivers on driver_id (left join from trips) and keep only rows with a matching driver.
5) Produce a per-driver summary with columns: driver_id, name, trips_count, avg_fare_usd (rounded to 2 decimals), last_3_trips_avg (average of the last 3 trips per driver ordered by ts_utc; if <3 trips, average over available).
6) Return the top 2 drivers by avg_fare_usd DESC (break ties by trips_count DESC, then name ASC) and print the final DataFrame schema (dtypes) to confirm transformations.
Explicitly use: head, tail, dropna, str.split, merge, groupby, sort_values, and rolling/agg as appropriate.
Quick Answer: This question evaluates proficiency with pandas data manipulation and cleaning, including handling missing values, trimming and splitting strings, numeric type conversion, merges, time-aware ordering, and aggregation for per-entity summaries.