Refactor SQL into an aggregated report
Company: DoorDash
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
You are given the following Postgres schema and small sample data for a food-delivery platform.
Schema:
- orders(order_id INT PRIMARY KEY, city TEXT, created_at TIMESTAMPTZ, completed_at TIMESTAMPTZ NULL, status TEXT CHECK (status IN ('created','accepted','picked_up','completed','cancelled')), courier_id INT, customer_id INT, subtotal NUMERIC(10,2), tip NUMERIC(10,2), cancel_reason TEXT NULL)
- couriers(courier_id INT PRIMARY KEY, region TEXT, is_online BOOLEAN, active_start TIMESTAMPTZ, active_end TIMESTAMPTZ)
Sample rows (minimal):
orders
order_id | city | created_at | completed_at | status | courier_id | subtotal | tip | cancel_reason
1 | LA | 2025-08-31 18:02:00+00 | 2025-08-31 18:35:00+00 | completed | 101 | 25.00 | 5.00| NULL
2 | LA | 2025-08-31 18:10:00+00 | NULL | cancelled | 102 | 12.00 | 0.00| no_courier
3 | LA | 2025-09-01 00:05:00+00 | 2025-09-01 00:25:00+00 | completed | 101 | 30.00 | 3.00| NULL
4 | SF | 2025-08-31 19:40:00+00 | 2025-08-31 20:05:00+00 | completed | 103 | 18.00 | 2.00| NULL
couriers
courier_id | region | is_online | active_start | active_end
101 | LA | true | 2025-08-31 17:45:00+00 | 2025-08-31 21:00:00+00
102 | LA | true | 2025-08-31 18:00:00+00 | 2025-08-31 19:00:00+00
103 | SF | true | 2025-08-31 19:00:00+00 | 2025-08-31 22:00:00+00
An analyst has this existing query that lists raw orders:
SELECT o.order_id, o.city, o.created_at, o.completed_at, o.status, o.subtotal, o.tip, o.cancel_reason
FROM orders o
WHERE o.created_at >= TIMESTAMPTZ '2025-08-30 00:00:00+00' AND o.city IN ('LA','SF');
Without rewriting from scratch, precisely describe how you would transform the above into an aggregated DAILY city-level report for LA and SF, using PostgreSQL 14 features (e.g., percentile_cont), with the following business rules and outputs. Then provide the final SQL:
- Local day should be America/Los_Angeles; bucket by local date derived from created_at. Attribute completions that occur after midnight back to the order's creation local date.
- Deduplicate any accidental duplicate rows by order_id, keeping the earliest created_at per order_id.
- Metrics per city, per local day: created_orders, completed_orders, completion_rate, cancellations_no_courier_count and rate, median_fulfillment_minutes, p95_fulfillment_minutes, GMV (sum of subtotal+tip for completed orders).
- Exclude city-days with created_orders < 20; order results by local_day, city.
- Ensure safe divisions (no divide-by-zero) and round rates/percentiles to 2 decimals.
Explain each transformation step briefly (CTEs, timezone conversion, deduping, grouping, ordered-set aggregates), then show the final query.
Quick Answer: This question evaluates proficiency in SQL-based data transformation and analytical aggregation, covering timezone-aware local date bucketing, deduplication, percentile/ordered-set aggregates, safe rate calculations, and other PostgreSQL 14 features used for reporting.