Write SQL and Pandas for Uber Trips

Q: Write SQL and Pandas for Uber Trips

This question evaluates SQL and Pandas data manipulation competencies, including joins, time-based filtering, handling NULLs, aggregations, rate and ratio calculations, and windowed comparisons over grouped city-day data.

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Technical Screen rounds at Uber.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Uber during technical interviews.

Question

Assume 'today' = 2025-09-01. You are given the following schema and small ASCII samples.

Tables
- riders(rider_id, name, signup_date)
- drivers(driver_id, name, car_year)
- trips(trip_id, rider_id, driver_id, request_time, pickup_time, dropoff_time, status, city, surge_multiplier, distance_km, fare_usd)
 • status ∈ {'completed','cancelled_by_rider','cancelled_by_driver','driver_no_show'}
- payments(trip_id, amount_usd, method, card_hash, charge_time, success)
- devices(user_type, user_id, device_id) where user_type ∈ {'rider','driver'}

Samples (minimal, not exhaustive)
riders
rider_id | name | signup_date
1 | Alice | 2025-08-20
2 | Bob | 2025-08-28
3 | Chen | 2025-08-30
4 | Deepa | 2025-07-15

trips
trip_id | rider_id | driver_id | request_time | pickup_time | dropoff_time | status | city | surge_multiplier | distance_km | fare_usd
1 | 1 | 10 | 2025-08-25 19:55 | 2025-08-25 20:05 | 2025-08-25 20:25 | completed | SF | 1.8 | 8.0 | 24.5
2 | 2 | 11 | 2025-08-26 14:10 | 2025-08-26 14:15 | 2025-08-26 14:30 | completed | SF | 1.0 | 5.0 | 12.0
3 | 2 | 11 | 2025-08-26 20:05 | 2025-08-26 20:30 | NULL | cancelled_by_driver | SF | 2.0 | 0.0 | 0.0
4 | 3 | 12 | 2025-08-27 02:30 | NULL | NULL | cancelled_by_rider | NYC | 1.0 | 0.0 | 0.0
5 | 3 | 10 | 2025-08-30 20:15 | 2025-08-30 20:25 | 2025-08-30 20:45 | completed | NYC | 1.7 | 6.5 | 18.0
6 | 1 | 12 | 2025-08-31 14:05 | 2025-08-31 14:10 | 2025-08-31 14:30 | completed | SF | 1.0 | 7.0 | 16.0
7 | 4 | 10 | 2025-08-31 20:05 | 2025-08-31 20:20 | 2025-08-31 20:40 | completed | SF | 1.6 | 9.0 | 22.0
8 | 4 | 11 | 2025-09-01 02:10 | NULL | NULL | driver_no_show | SF | 1.0 | 0.0 | 0.0

payments
trip_id | amount_usd | method | card_hash | charge_time | success
1 | 24.5 | card | abc123 | 2025-08-25 20:26 | true
2 | 12.0 | card | xyz777 | 2025-08-26 14:31 | true
3 | 0.0 | card | xyz777 | 2025-08-26 20:31 | false
4 | 0.0 | card | qwe555 | 2025-08-27 02:32 | false
5 | 18.0 | card | abc123 | 2025-08-30 20:46 | true
6 | 16.0 | card | abc123 | 2025-08-31 14:31 | true
7 | 22.0 | cash | NULL | 2025-08-31 20:41 | false
8 | 0.0 | card | xyz777 | 2025-09-01 02:12 | false

devices
user_type | user_id | device_id
rider | 1 | devA
rider | 2 | devB
rider | 3 | devC
rider | 4 | devB
driver | 10 | devX
driver | 11 | devY
driver | 12 | devZ

Write SQL answers for A–D and a Pandas answer for E:
A) For the last 7 days (2025-08-26 to 2025-09-01 inclusive), compute per city: total requests, completed trips, and completion rate = completed / requests. Count a request if a row exists in trips (any status). Order by completion rate ascending. Handle NULL times robustly and ensure date filtering uses request_time in UTC.
B) For each driver, over the last 7 days, compute the ratio: avg surge during 20:00–21:59 divided by avg surge during 14:00–15:59 on their completed trips, within the same city-day buckets. Return drivers where both windows have at least 1 completed trip and the ratio > 1.5. Include driver_id, city, counts per window, both averages, and the ratio.
C) Over the last 7 days, compute the median pickup wait (pickup_time - request_time) per city for completed trips, after excluding trips above the city-specific 95th percentile wait. Use window functions to compute the percentile cutoff and the median on the truncated set.
D) Identify likely duplicate rider accounts in the last 30 days: output rider_id_a, rider_id_b (a<b), evidence_type ('device' or 'card'), evidence_value (device_id or card_hash), first_seen_time. A pair qualifies if the riders share either the same device_id in devices or the same payments.card_hash used on trips by different rider_ids. Exclude NULL evidence values. For 'card', join payments→trips to map card_hash to rider_id.
E) Pandas: Given a DataFrame trips_df of trips, compute 7-day new-user retention by cohort for riders whose first completed trip date is between 2025-08-25 and 2025-08-31. A rider is retained if they have ≥1 additional completed trip with dropoff_time within 7 days (inclusive) after their first completed trip. Return a DataFrame with cohort_date, n_new, n_retained, retention_rate, sorted by cohort_date.

PracHub · Accepted Answer

This question evaluates SQL and Pandas data manipulation competencies, including joins, time-based filtering, handling NULLs, aggregations, rate and ratio calculations, and windowed comparisons over grouped city-day data.

Quick Overview

Quick Overview