##### Question You work on a rideshare / ride-hailing product that connects **drivers** and **riders**, with a focus on **airport pickups**. Using SQL, answer the questions below. Assume all timestamps are stored in UTC and the analysis window is the **last 28 days ending at `:as_of_ts` (inclusive)** — older variants of this question use a rolling **last 30 days**, so use whichever window your interviewer specifies. State any additional assumptions you make. ### Tables **drivers** - `driver_id` BIGINT (PK) - `home_city_id` BIGINT - `is_active` BOOLEAN - `activated_at` TIMESTAMPTZ **riders** - `rider_id` BIGINT (PK) - `is_active` BOOLEAN - `is_fraud` BOOLEAN - `first_trip_at` TIMESTAMPTZ **airports** - `airport_id` BIGINT (PK) - `airport_code` TEXT — e.g., `'SFO'` - `country_code` TEXT — e.g., `'US'` **trips** (one row per trip request) - `trip_id` BIGINT (PK) - `driver_id` BIGINT (nullable until dispatch) - `rider_id` BIGINT - `request_at` TIMESTAMPTZ - `airport_id` BIGINT (nullable; non-null means airport pickup) - `status` TEXT — one of `('completed','canceled','expired')` - `canceled_by` TEXT — nullable; if canceled, one of `('driver','rider','system')` - `canceled_at` TIMESTAMPTZ (nullable) ### Q1: Top drivers by airport completions Return the **top 10 drivers** by number of **completed airport-pickup trips** in the window, restricted to: - airports in the US (`airports.country_code = 'US'`) - active drivers only (`drivers.is_active = TRUE`) - exclude fraudulent riders (rows where `riders.is_fraud = TRUE`) Output columns: - `driver_id` - `completed_trips` - `total_airport_requests` — count of all airport trip requests, regardless of status - `completion_rate` = `completed_trips / total_airport_requests` Tie-breakers (in order): higher `completed_trips`, then higher `completion_rate`, then lower `driver_id`. ### Q2: Cancellation rate by airport and canceler For the same window and US-airport restriction, compute cancellation rates **by airport**, split by who canceled. For each `airport_code`, return: - `total_requests` - `driver_cancel_rate` = driver-canceled trips / total_requests - `rider_cancel_rate` = rider-canceled trips / total_requests - `system_cancel_rate` = system-canceled trips / total_requests - `overall_cancel_rate` = all canceled trips / total_requests Include only airports with `total_requests >= 100`. ### Q3: Top riders by cancellations Return the **top 10 riders** with the highest number of **canceled** airport-pickup trips in the window (US airports). Output columns: - `rider_id` - `canceled_trips` > Note: Many candidates make mistakes around filtering joined tables (conditions placed in `WHERE` vs `ON`). Write queries that handle nulls and outer joins correctly, and that preserve airports/drivers with requests even when some joined attributes are missing. Be explicit about which filters belong in `ON` versus `WHERE`.

This onsite Data Scientist SQL question (data manipulation) uses a rideshare/airport-pickup dataset to test joins with correct ON-vs-WHERE semantics, time-window filtering, conditional aggregation for counts and rates over a shared denominator, null and divide-by-zero handling, and deterministic tie-breaking. Candidates rank top drivers by completed airport trips, compute per-airport cancellation rates split by who canceled, and rank top riders by cancellations.

How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

What difficulty level is this interview question?

This is a easy difficulty Data Manipulation (SQL/Python) question, commonly asked during Onsite rounds at PayPal.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at PayPal during technical interviews.

Write SQL for top drivers and cancellation rates

Question

You work on a rideshare / ride-hailing product that connects drivers and riders, with a focus on airport pickups. Using SQL, answer the questions below.

Assume all timestamps are stored in UTC and the analysis window is the last 28 days ending at :as_of_ts (inclusive) — older variants of this question use a rolling last 30 days, so use whichever window your interviewer specifies. State any additional assumptions you make.

Tables

drivers

driver_id BIGINT (PK)
home_city_id BIGINT
is_active BOOLEAN
activated_at TIMESTAMPTZ

riders

rider_id BIGINT (PK)
is_active BOOLEAN
is_fraud BOOLEAN
first_trip_at TIMESTAMPTZ

airports

airport_id BIGINT (PK)
airport_code TEXT — e.g., 'SFO'
country_code TEXT — e.g., 'US'

trips (one row per trip request)

trip_id BIGINT (PK)
driver_id BIGINT (nullable until dispatch)
rider_id BIGINT
request_at TIMESTAMPTZ
airport_id BIGINT (nullable; non-null means airport pickup)
status TEXT — one of ('completed','canceled','expired')
canceled_by TEXT — nullable; if canceled, one of ('driver','rider','system')
canceled_at TIMESTAMPTZ (nullable)

Q1: Top drivers by airport completions

Return the top 10 drivers by number of completed airport-pickup trips in the window, restricted to:

airports in the US ( airports.country_code = 'US' )
active drivers only ( drivers.is_active = TRUE )
exclude fraudulent riders (rows where riders.is_fraud = TRUE )

Output columns:

driver_id
completed_trips
total_airport_requests — count of all airport trip requests, regardless of status
completion_rate = completed_trips / total_airport_requests

Tie-breakers (in order): higher completed_trips, then higher completion_rate, then lower driver_id.

Q2: Cancellation rate by airport and canceler

For the same window and US-airport restriction, compute cancellation rates by airport, split by who canceled. For each airport_code, return:

total_requests
driver_cancel_rate = driver-canceled trips / total_requests
rider_cancel_rate = rider-canceled trips / total_requests
system_cancel_rate = system-canceled trips / total_requests
overall_cancel_rate = all canceled trips / total_requests

Include only airports with total_requests >= 100.

Q3: Top riders by cancellations

Return the top 10 riders with the highest number of canceled airport-pickup trips in the window (US airports).

Output columns:

rider_id
canceled_trips

Note: Many candidates make mistakes around filtering joined tables (conditions placed in WHERE vs ON). Write queries that handle nulls and outer joins correctly, and that preserve airports/drivers with requests even when some joined attributes are missing. Be explicit about which filters belong in ON versus WHERE.

Question

You work on a rideshare / ride-hailing product that connects drivers and riders, with a focus on airport pickups. Using SQL, answer the questions below.

Tables

drivers

driver_id BIGINT (PK)
home_city_id BIGINT
is_active BOOLEAN
activated_at TIMESTAMPTZ

riders

rider_id BIGINT (PK)
is_active BOOLEAN
is_fraud BOOLEAN
first_trip_at TIMESTAMPTZ

airports

airport_id BIGINT (PK)
airport_code TEXT — e.g., 'SFO'
country_code TEXT — e.g., 'US'

trips (one row per trip request)

trip_id BIGINT (PK)
driver_id BIGINT (nullable until dispatch)
rider_id BIGINT
request_at TIMESTAMPTZ
airport_id BIGINT (nullable; non-null means airport pickup)
status TEXT — one of ('completed','canceled','expired')
canceled_by TEXT — nullable; if canceled, one of ('driver','rider','system')
canceled_at TIMESTAMPTZ (nullable)

Q1: Top drivers by airport completions

Return the top 10 drivers by number of completed airport-pickup trips in the window, restricted to:

airports in the US ( airports.country_code = 'US' )
active drivers only ( drivers.is_active = TRUE )
exclude fraudulent riders (rows where riders.is_fraud = TRUE )

Output columns:

driver_id
completed_trips
total_airport_requests — count of all airport trip requests, regardless of status
completion_rate = completed_trips / total_airport_requests

Tie-breakers (in order): higher completed_trips, then higher completion_rate, then lower driver_id.

Q2: Cancellation rate by airport and canceler

For the same window and US-airport restriction, compute cancellation rates by airport, split by who canceled. For each airport_code, return:

total_requests
driver_cancel_rate = driver-canceled trips / total_requests
rider_cancel_rate = rider-canceled trips / total_requests
system_cancel_rate = system-canceled trips / total_requests
overall_cancel_rate = all canceled trips / total_requests

Include only airports with total_requests >= 100.

Q3: Top riders by cancellations

Return the top 10 riders with the highest number of canceled airport-pickup trips in the window (US airports).

Output columns:

rider_id
canceled_trips

Note: Many candidates make mistakes around filtering joined tables (conditions placed in WHERE vs ON). Write queries that handle nulls and outer joins correctly, and that preserve airports/drivers with requests even when some joined attributes are missing. Be explicit about which filters belong in ON versus WHERE.

Write SQL for top drivers and cancellation rates

Quick Overview

Question

Tables

Q1: Top drivers by airport completions

Q2: Cancellation rate by airport and canceler

Q3: Top riders by cancellations

Solution

Submit Your Answer to Earn 20XP

Write SQL for top drivers and cancellation rates

Quick Overview

Question

Tables

Q1: Top drivers by airport completions

Q2: Cancellation rate by airport and canceler

Q3: Top riders by cancellations

Solution

Submit Your Answer to Earn 20XP