How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

What difficulty level is this interview question?

This is a Medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Technical Screen rounds at DoorDash.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at DoorDash during technical interviews.

Refactor SQL into an aggregated report | DoorDash Interview Question

Q: Refactor SQL into an aggregated report

This question evaluates proficiency in SQL-based data transformation and analytical aggregation, covering timezone-aware local date bucketing, deduplication, percentile/ordered-set aggregates, safe rate calculations, and other PostgreSQL 14 features used for reporting.

You are given the following Postgres schema and small sample data for a food-delivery platform.

Schema:

orders(order_id INT PRIMARY KEY, city TEXT, created_at TIMESTAMPTZ, completed_at TIMESTAMPTZ NULL, status TEXT CHECK (status IN ('created','accepted','picked_up','completed','cancelled')), courier_id INT, customer_id INT, subtotal NUMERIC(10,2), tip NUMERIC(10,2), cancel_reason TEXT NULL)
couriers(courier_id INT PRIMARY KEY, region TEXT, is_online BOOLEAN, active_start TIMESTAMPTZ, active_end TIMESTAMPTZ)

Sample rows (minimal): orders order_id | city | created_at | completed_at | status | courier_id | subtotal | tip | cancel_reason 1 | LA | 2025-08-31 18:02:00+00 | 2025-08-31 18:35:00+00 | completed | 101 | 25.00 | 5.00| NULL 2 | LA | 2025-08-31 18:10:00+00 | NULL | cancelled | 102 | 12.00 | 0.00| no_courier 3 | LA | 2025-09-01 00:05:00+00 | 2025-09-01 00:25:00+00 | completed | 101 | 30.00 | 3.00| NULL 4 | SF | 2025-08-31 19:40:00+00 | 2025-08-31 20:05:00+00 | completed | 103 | 18.00 | 2.00| NULL

couriers courier_id | region | is_online | active_start | active_end 101 | LA | true | 2025-08-31 17:45:00+00 | 2025-08-31 21:00:00+00 102 | LA | true | 2025-08-31 18:00:00+00 | 2025-08-31 19:00:00+00 103 | SF | true | 2025-08-31 19:00:00+00 | 2025-08-31 22:00:00+00

An analyst has this existing query that lists raw orders: SELECT o.order_id, o.city, o.created_at, o.completed_at, o.status, o.subtotal, o.tip, o.cancel_reason FROM orders o WHERE o.created_at >= TIMESTAMPTZ '2025-08-30 00:00:00+00' AND o.city IN ('LA','SF');

Without rewriting from scratch, precisely describe how you would transform the above into an aggregated DAILY city-level report for LA and SF, using PostgreSQL 14 features (e.g., percentile_cont), with the following business rules and outputs. Then provide the final SQL:

Local day should be America/Los_Angeles; bucket by local date derived from created_at. Attribute completions that occur after midnight back to the order's creation local date.
Deduplicate any accidental duplicate rows by order_id, keeping the earliest created_at per order_id.
Metrics per city, per local day: created_orders, completed_orders, completion_rate, cancellations_no_courier_count and rate, median_fulfillment_minutes, p95_fulfillment_minutes, GMV (sum of subtotal+tip for completed orders).
Exclude city-days with created_orders < 20; order results by local_day, city.
Ensure safe divisions (no divide-by-zero) and round rates/percentiles to 2 decimals. Explain each transformation step briefly (CTEs, timezone conversion, deduping, grouping, ordered-set aggregates), then show the final query.

You are given the following Postgres schema and small sample data for a food-delivery platform.

Schema:

orders(order_id INT PRIMARY KEY, city TEXT, created_at TIMESTAMPTZ, completed_at TIMESTAMPTZ NULL, status TEXT CHECK (status IN ('created','accepted','picked_up','completed','cancelled')), courier_id INT, customer_id INT, subtotal NUMERIC(10,2), tip NUMERIC(10,2), cancel_reason TEXT NULL)
couriers(courier_id INT PRIMARY KEY, region TEXT, is_online BOOLEAN, active_start TIMESTAMPTZ, active_end TIMESTAMPTZ)

Local day should be America/Los_Angeles; bucket by local date derived from created_at. Attribute completions that occur after midnight back to the order's creation local date.
Deduplicate any accidental duplicate rows by order_id, keeping the earliest created_at per order_id.
Metrics per city, per local day: created_orders, completed_orders, completion_rate, cancellations_no_courier_count and rate, median_fulfillment_minutes, p95_fulfillment_minutes, GMV (sum of subtotal+tip for completed orders).
Exclude city-days with created_orders < 20; order results by local_day, city.
Ensure safe divisions (no divide-by-zero) and round rates/percentiles to 2 decimals. Explain each transformation step briefly (CTEs, timezone conversion, deduping, grouping, ordered-set aggregates), then show the final query.

Refactor SQL into an aggregated report

Quick Overview

Comments (0)

Refactor SQL into an aggregated report

Quick Overview

Comments (0)