Write SQL and Pandas for Uber Trips
Company: Uber
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
Assume 'today' = 2025-09-01. You are given the following schema and small ASCII samples.
Tables
- riders(rider_id, name, signup_date)
- drivers(driver_id, name, car_year)
- trips(trip_id, rider_id, driver_id, request_time, pickup_time, dropoff_time, status, city, surge_multiplier, distance_km, fare_usd)
• status ∈ {'completed','cancelled_by_rider','cancelled_by_driver','driver_no_show'}
- payments(trip_id, amount_usd, method, card_hash, charge_time, success)
- devices(user_type, user_id, device_id) where user_type ∈ {'rider','driver'}
Samples (minimal, not exhaustive)
riders
rider_id | name | signup_date
1 | Alice | 2025-08-20
2 | Bob | 2025-08-28
3 | Chen | 2025-08-30
4 | Deepa | 2025-07-15
drivers
driver_id | name | car_year
10 | Diego | 2018
11 | Elena | 2021
12 | Farid | 2015
trips
trip_id | rider_id | driver_id | request_time | pickup_time | dropoff_time | status | city | surge_multiplier | distance_km | fare_usd
1 | 1 | 10 | 2025-08-25 19:55 | 2025-08-25 20:05 | 2025-08-25 20:25 | completed | SF | 1.8 | 8.0 | 24.5
2 | 2 | 11 | 2025-08-26 14:10 | 2025-08-26 14:15 | 2025-08-26 14:30 | completed | SF | 1.0 | 5.0 | 12.0
3 | 2 | 11 | 2025-08-26 20:05 | 2025-08-26 20:30 | NULL | cancelled_by_driver | SF | 2.0 | 0.0 | 0.0
4 | 3 | 12 | 2025-08-27 02:30 | NULL | NULL | cancelled_by_rider | NYC | 1.0 | 0.0 | 0.0
5 | 3 | 10 | 2025-08-30 20:15 | 2025-08-30 20:25 | 2025-08-30 20:45 | completed | NYC | 1.7 | 6.5 | 18.0
6 | 1 | 12 | 2025-08-31 14:05 | 2025-08-31 14:10 | 2025-08-31 14:30 | completed | SF | 1.0 | 7.0 | 16.0
7 | 4 | 10 | 2025-08-31 20:05 | 2025-08-31 20:20 | 2025-08-31 20:40 | completed | SF | 1.6 | 9.0 | 22.0
8 | 4 | 11 | 2025-09-01 02:10 | NULL | NULL | driver_no_show | SF | 1.0 | 0.0 | 0.0
payments
trip_id | amount_usd | method | card_hash | charge_time | success
1 | 24.5 | card | abc123 | 2025-08-25 20:26 | true
2 | 12.0 | card | xyz777 | 2025-08-26 14:31 | true
3 | 0.0 | card | xyz777 | 2025-08-26 20:31 | false
4 | 0.0 | card | qwe555 | 2025-08-27 02:32 | false
5 | 18.0 | card | abc123 | 2025-08-30 20:46 | true
6 | 16.0 | card | abc123 | 2025-08-31 14:31 | true
7 | 22.0 | cash | NULL | 2025-08-31 20:41 | false
8 | 0.0 | card | xyz777 | 2025-09-01 02:12 | false
devices
user_type | user_id | device_id
rider | 1 | devA
rider | 2 | devB
rider | 3 | devC
rider | 4 | devB
driver | 10 | devX
driver | 11 | devY
driver | 12 | devZ
Write SQL answers for A–D and a Pandas answer for E:
A) For the last 7 days (2025-08-26 to 2025-09-01 inclusive), compute per city: total requests, completed trips, and completion rate = completed / requests. Count a request if a row exists in trips (any status). Order by completion rate ascending. Handle NULL times robustly and ensure date filtering uses request_time in UTC.
B) For each driver, over the last 7 days, compute the ratio: avg surge during 20:00–21:59 divided by avg surge during 14:00–15:59 on their completed trips, within the same city-day buckets. Return drivers where both windows have at least 1 completed trip and the ratio > 1.5. Include driver_id, city, counts per window, both averages, and the ratio.
C) Over the last 7 days, compute the median pickup wait (pickup_time - request_time) per city for completed trips, after excluding trips above the city-specific 95th percentile wait. Use window functions to compute the percentile cutoff and the median on the truncated set.
D) Identify likely duplicate rider accounts in the last 30 days: output rider_id_a, rider_id_b (a<b), evidence_type ('device' or 'card'), evidence_value (device_id or card_hash), first_seen_time. A pair qualifies if the riders share either the same device_id in devices or the same payments.card_hash used on trips by different rider_ids. Exclude NULL evidence values. For 'card', join payments→trips to map card_hash to rider_id.
E) Pandas: Given a DataFrame trips_df of trips, compute 7-day new-user retention by cohort for riders whose first completed trip date is between 2025-08-25 and 2025-08-31. A rider is retained if they have ≥1 additional completed trip with dropoff_time within 7 days (inclusive) after their first completed trip. Return a DataFrame with cohort_date, n_new, n_retained, retention_rate, sorted by cohort_date.
Quick Answer: This question evaluates SQL and Pandas data manipulation competencies, including joins, time-based filtering, handling NULLs, aggregations, rate and ratio calculations, and windowed comparisons over grouped city-day data.