PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches

Quick Overview

This question evaluates proficiency in time-series filtering, relational joins, grouped aggregations (median and high-percentile calculations), numeric stability of percentile methods, and performance-aware vectorized data manipulation using SQL or pandas.

  • Medium
  • Two Sigma
  • Data Manipulation (SQL/Python)
  • Data Scientist

Analyze NYC taxi trips efficiently over last 7 days

Company: Two Sigma

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Take-home Project

Use today = 2025-09-01. Consider NYC taxi trip data over the last 7 days inclusive (2025-08-26 to 2025-09-01, America/New_York). You receive two datasets and must write a fast, vectorized analysis (no Python for-loops over rows). Data schema and tiny samples: trips(id, taxi_id, pickup_ts, dropoff_ts, pickup_zone_id, dropoff_zone_id, distance_miles, fare_amount) 1 | 101 | 2025-08-26 08:15 | 2025-08-26 08:45 | 1 | 3 | 6.0 | 18.50 2 | 102 | 2025-08-26 00:20 | 2025-08-26 00:50 | 2 | 1 | 4.0 | 14.00 3 | 101 | 2025-08-28 01:10 | 2025-08-28 01:40 | 1 | 1 | 3.0 | 12.00 4 | 103 | 2025-08-30 17:05 | 2025-08-30 17:25 | 4 | 2 | 2.5 | 9.50 5 | 104 | 2025-09-01 02:30 | 2025-09-01 03:20 | 1 | 4 | 10.0 | 30.00 6 | 102 | 2025-08-31 23:50 | 2025-09-01 00:10 | 3 | 3 | 5.0 | 16.00 zones(zone_id, borough) 1 | Manhattan 2 | Brooklyn 3 | Queens 4 | Bronx Tasks: 1) After joining trips with zones on pickup_zone_id, compute per (borough, hour_of_day from pickup_ts) the median trip speed in mph, where speed = distance_miles / duration_hours. Filter trips to 1 ≤ duration_minutes ≤ 120 and 1 ≤ speed ≤ 80. Return the top 3 (borough, hour) pairs by median speed; break ties by borough asc, then hour asc. Report the exact (borough, hour, median_speed_mph) triples. 2) For trips with pickup borough = 'Manhattan' and pickup time between 00:00 and 05:00 inclusive, identify the 3 taxi_id with the largest 95th percentile of trip duration (minutes) over the same date range; break ties by taxi_id asc. Clearly define how you compute the 95th percentile (e.g., pandas/numpy method) and use a stable, vectorized approach. 3) Provide pandas code (or SQL) that runs in O(n log n) or better due to grouping/quantile operations, avoids per-row loops, and uses: parsed datetime dtypes; one-to-many join performed once; categorical dtype for borough; appropriate indexing on pickup_ts for time filtering. 4) Briefly justify two memory/performance optimizations you employ (e.g., downcasting floats/ints, using groupby-agg with quantile in a single pass, avoiding intermediate copies).

Quick Answer: This question evaluates proficiency in time-series filtering, relational joins, grouped aggregations (median and high-percentile calculations), numeric stability of percentile methods, and performance-aware vectorized data manipulation using SQL or pandas.

Last updated: Mar 29, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.