PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches

Quick Overview

This question evaluates a candidate's ability to clean and aggregate large-scale time-series data in Pandas, emphasizing timezone normalization, deterministic deduplication, DST-aware alignment, event-sequence aggregation, rolling statistical features, anomaly detection, and memory- and performance-conscious ETL techniques.

  • Medium
  • Roblox
  • Data Manipulation (SQL/Python)
  • Data Scientist

Clean and aggregate factory event data in Pandas

Company: Roblox

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Take-home Project

You are given three Pandas DataFrames for a factory: (1) events[event_id, machine_id, ts_utc (datetime64[ns, UTC]), event_type in {'start','stop','fault'}, batch_id], (2) telemetry[machine_id, ts_local (datetime64[ns]), temperature_C, rpm, power_kW, timezone (IANA string like 'US/Pacific')], (3) calendar[date (YYYY-MM-DD), is_holiday (bool), shift in {'A','B','C'}]. Data issues: late-arriving events up to 48 hours late, duplicate events (same event_id with ts_utc differences up to ±2 seconds), daylight saving transitions, and missing telemetry rows. Memory budget is 1 GB, total rows ≈50M. Tasks: a) Normalize all time to a single axis; deduplicate events with a deterministic rule (state your rule) while preserving correct event order. b) For the last 7 calendar days up to and including today=2025-09-01 in each machine’s local time, compute per-machine hourly features: throughput (count of completed start→stop cycles), 95th percentile temperature, and a rolling 24-hour z-score of power_kW. Handle missing hours and DST gaps/overlaps correctly. c) Join features into a tidy machine-hour panel indexed by [machine_id, hour_start_utc); impute missing values robustly; flag anomalies where |z|>3. Provide Pandas code snippets and explain performance tactics (chunked IO, dtypes, categoricals, Parquet, vectorized ops) and how you would test correctness on edge cases.

Quick Answer: This question evaluates a candidate's ability to clean and aggregate large-scale time-series data in Pandas, emphasizing timezone normalization, deterministic deduplication, DST-aware alignment, event-sequence aggregation, rolling statistical features, anomaly detection, and memory- and performance-conscious ETL techniques.

Last updated: Mar 29, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Write SQL for influence score and follower growth - Roblox (easy)
  • Match requests and accepts into friendships in SQL - Roblox (Medium)
  • Implement deduped CTR/RPM aggregator over event stream - Roblox (Medium)
  • Compute CTR, RPM, and daily RPM variability in SQL - Roblox (Medium)
  • Write SQL for ads metrics and variability - Roblox (Medium)