Write SQL and Pandas for Uber Trips

Q: Write SQL and Pandas for Uber Trips

This is a Data Manipulation (SQL/Python) interview question from Uber for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

Assume 'today' = 2025-09-01. You are given the following schema and small ASCII samples.

Tables

riders(rider_id, name, signup_date)
drivers(driver_id, name, car_year)
trips(trip_id, rider_id, driver_id, request_time, pickup_time, dropoff_time, status, city, surge_multiplier, distance_km, fare_usd) • status ∈ {'completed','cancelled_by_rider','cancelled_by_driver','driver_no_show'}
payments(trip_id, amount_usd, method, card_hash, charge_time, success)
devices(user_type, user_id, device_id) where user_type ∈ {'rider','driver'}

Samples (minimal, not exhaustive) riders rider_id | name | signup_date 1 | Alice | 2025-08-20 2 | Bob | 2025-08-28 3 | Chen | 2025-08-30 4 | Deepa | 2025-07-15

trips trip_id | rider_id | driver_id | request_time | pickup_time | dropoff_time | status | city | surge_multiplier | distance_km | fare_usd 1 | 1 | 10 | 2025-08-25 19:55 | 2025-08-25 20:05 | 2025-08-25 20:25 | completed | SF | 1.8 | 8.0 | 24.5 2 | 2 | 11 | 2025-08-26 14:10 | 2025-08-26 14:15 | 2025-08-26 14:30 | completed | SF | 1.0 | 5.0 | 12.0 3 | 2 | 11 | 2025-08-26 20:05 | 2025-08-26 20:30 | NULL | cancelled_by_driver | SF | 2.0 | 0.0 | 0.0 4 | 3 | 12 | 2025-08-27 02:30 | NULL | NULL | cancelled_by_rider | NYC | 1.0 | 0.0 | 0.0 5 | 3 | 10 | 2025-08-30 20:15 | 2025-08-30 20:25 | 2025-08-30 20:45 | completed | NYC | 1.7 | 6.5 | 18.0 6 | 1 | 12 | 2025-08-31 14:05 | 2025-08-31 14:10 | 2025-08-31 14:30 | completed | SF | 1.0 | 7.0 | 16.0 7 | 4 | 10 | 2025-08-31 20:05 | 2025-08-31 20:20 | 2025-08-31 20:40 | completed | SF | 1.6 | 9.0 | 22.0 8 | 4 | 11 | 2025-09-01 02:10 | NULL | NULL | driver_no_show | SF | 1.0 | 0.0 | 0.0

payments trip_id | amount_usd | method | card_hash | charge_time | success 1 | 24.5 | card | abc123 | 2025-08-25 20:26 | true 2 | 12.0 | card | xyz777 | 2025-08-26 14:31 | true 3 | 0.0 | card | xyz777 | 2025-08-26 20:31 | false 4 | 0.0 | card | qwe555 | 2025-08-27 02:32 | false 5 | 18.0 | card | abc123 | 2025-08-30 20:46 | true 6 | 16.0 | card | abc123 | 2025-08-31 14:31 | true 7 | 22.0 | cash | NULL | 2025-08-31 20:41 | false 8 | 0.0 | card | xyz777 | 2025-09-01 02:12 | false

devices user_type | user_id | device_id rider | 1 | devA rider | 2 | devB rider | 3 | devC rider | 4 | devB driver | 10 | devX driver | 11 | devY driver | 12 | devZ

Write SQL answers for A–D and a Pandas answer for E: A) For the last 7 days (2025-08-26 to 2025-09-01 inclusive), compute per city: total requests, completed trips, and completion rate = completed / requests. Count a request if a row exists in trips (any status). Order by completion rate ascending. Handle NULL times robustly and ensure date filtering uses request_time in UTC. B) For each driver, over the last 7 days, compute the ratio: avg surge during 20:00–21:59 divided by avg surge during 14:00–15:59 on their completed trips, within the same city-day buckets. Return drivers where both windows have at least 1 completed trip and the ratio > 1.5. Include driver_id, city, counts per window, both averages, and the ratio. C) Over the last 7 days, compute the median pickup wait (pickup_time - request_time) per city for completed trips, after excluding trips above the city-specific 95th percentile wait. Use window functions to compute the percentile cutoff and the median on the truncated set. D) Identify likely duplicate rider accounts in the last 30 days: output rider_id_a, rider_id_b (a<b), evidence_type ('device' or 'card'), evidence_value (device_id or card_hash), first_seen_time. A pair qualifies if the riders share either the same device_id in devices or the same payments.card_hash used on trips by different rider_ids. Exclude NULL evidence values. For 'card', join payments→trips to map card_hash to rider_id. E) Pandas: Given a DataFrame trips_df of trips, compute 7-day new-user retention by cohort for riders whose first completed trip date is between 2025-08-25 and 2025-08-31. A rider is retained if they have ≥1 additional completed trip with dropoff_time within 7 days (inclusive) after their first completed trip. Return a DataFrame with cohort_date, n_new, n_retained, retention_rate, sorted by cohort_date.

Write SQL and Pandas for Uber Trips

Comments (0)