Analyze NYC taxi trips efficiently over last 7 days

Q: Analyze NYC taxi trips efficiently over last 7 days

This question evaluates proficiency in time-series filtering, relational joins, grouped aggregations (median and high-percentile calculations), numeric stability of percentile methods, and performance-aware vectorized data manipulation using SQL or pandas.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

Use today = 2025-09-01. Consider NYC taxi trip data over the last 7 days inclusive (2025-08-26 to 2025-09-01, America/New_York). You receive two datasets and must write a fast, vectorized analysis (no Python for-loops over rows). Data schema and tiny samples:

trips(id, taxi_id, pickup_ts, dropoff_ts, pickup_zone_id, dropoff_zone_id, distance_miles, fare_amount) 1 | 101 | 2025-08-26 08:15 | 2025-08-26 08:45 | 1 | 3 | 6.0 | 18.50 2 | 102 | 2025-08-26 00:20 | 2025-08-26 00:50 | 2 | 1 | 4.0 | 14.00 3 | 101 | 2025-08-28 01:10 | 2025-08-28 01:40 | 1 | 1 | 3.0 | 12.00 4 | 103 | 2025-08-30 17:05 | 2025-08-30 17:25 | 4 | 2 | 2.5 | 9.50 5 | 104 | 2025-09-01 02:30 | 2025-09-01 03:20 | 1 | 4 | 10.0 | 30.00 6 | 102 | 2025-08-31 23:50 | 2025-09-01 00:10 | 3 | 3 | 5.0 | 16.00

zones(zone_id, borough) 1 | Manhattan 2 | Brooklyn 3 | Queens 4 | Bronx

Tasks:

After joining trips with zones on pickup_zone_id, compute per (borough, hour_of_day from pickup_ts) the median trip speed in mph, where speed = distance_miles / duration_hours. Filter trips to 1 ≤ duration_minutes ≤ 120 and 1 ≤ speed ≤ 80. Return the top 3 (borough, hour) pairs by median speed; break ties by borough asc, then hour asc. Report the exact (borough, hour, median_speed_mph) triples. 2) For trips with pickup borough = 'Manhattan' and pickup time between 00:00 and 05:00 inclusive, identify the 3 taxi_id with the largest 95th percentile of trip duration (minutes) over the same date range; break ties by taxi_id asc. Clearly define how you compute the 95th percentile (e.g., pandas/numpy method) and use a stable, vectorized approach. 3) Provide pandas code (or SQL) that runs in O(n log n) or better due to grouping/quantile operations, avoids per-row loops, and uses: parsed datetime dtypes; one-to-many join performed once; categorical dtype for borough; appropriate indexing on pickup_ts for time filtering. 4) Briefly justify two memory/performance optimizations you employ (e.g., downcasting floats/ints, using groupby-agg with quantile in a single pass, avoiding intermediate copies).

Analyze NYC taxi trips efficiently over last 7 days

Overview

Comments (0)