Analyze NYC taxi trips efficiently over last 7 days

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a Medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Take-home Project rounds at Two Sigma.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Two Sigma during technical interviews.

Question

Use today = 2025-09-01. Consider NYC taxi trip data over the last 7 days inclusive (2025-08-26 to 2025-09-01, America/New_York). You receive two datasets and must write a fast, vectorized analysis (no Python for-loops over rows). Data schema and tiny samples:

trips(id, taxi_id, pickup_ts, dropoff_ts, pickup_zone_id, dropoff_zone_id, distance_miles, fare_amount)
 1 | 101 | 2025-08-26 08:15 | 2025-08-26 08:45 | 1 | 3 | 6.0 | 18.50
 2 | 102 | 2025-08-26 00:20 | 2025-08-26 00:50 | 2 | 1 | 4.0 | 14.00
 3 | 101 | 2025-08-28 01:10 | 2025-08-28 01:40 | 1 | 1 | 3.0 | 12.00
 4 | 103 | 2025-08-30 17:05 | 2025-08-30 17:25 | 4 | 2 | 2.5 | 9.50
 5 | 104 | 2025-09-01 02:30 | 2025-09-01 03:20 | 1 | 4 | 10.0 | 30.00
 6 | 102 | 2025-08-31 23:50 | 2025-09-01 00:10 | 3 | 3 | 5.0 | 16.00

zones(zone_id, borough)
 1 | Manhattan
 2 | Brooklyn
 3 | Queens
 4 | Bronx

Tasks:
1) After joining trips with zones on pickup_zone_id, compute per (borough, hour_of_day from pickup_ts) the median trip speed in mph, where speed = distance_miles / duration_hours. Filter trips to 1 ≤ duration_minutes ≤ 120 and 1 ≤ speed ≤ 80. Return the top 3 (borough, hour) pairs by median speed; break ties by borough asc, then hour asc. Report the exact (borough, hour, median_speed_mph) triples. 2) For trips with pickup borough = 'Manhattan' and pickup time between 00:00 and 05:00 inclusive, identify the 3 taxi_id with the largest 95th percentile of trip duration (minutes) over the same date range; break ties by taxi_id asc. Clearly define how you compute the 95th percentile (e.g., pandas/numpy method) and use a stable, vectorized approach. 3) Provide pandas code (or SQL) that runs in O(n log n) or better due to grouping/quantile operations, avoids per-row loops, and uses: parsed datetime dtypes; one-to-many join performed once; categorical dtype for borough; appropriate indexing on pickup_ts for time filtering. 4) Briefly justify two memory/performance optimizations you employ (e.g., downcasting floats/ints, using groupby-agg with quantile in a single pass, avoiding intermediate copies).

PracHub · Accepted Answer

This question evaluates proficiency in time-series filtering, relational joins, grouped aggregations (median and high-percentile calculations), numeric stability of percentile methods, and performance-aware vectorized data manipulation using SQL or pandas.

Quick Overview

Quick Overview