Clean and aggregate factory event data in Pandas

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Take-home Project rounds at Roblox.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Roblox during technical interviews.

Question

You are given three Pandas DataFrames for a factory: (1) events[event_id, machine_id, ts_utc (datetime64[ns, UTC]), event_type in {'start','stop','fault'}, batch_id], (2) telemetry[machine_id, ts_local (datetime64[ns]), temperature_C, rpm, power_kW, timezone (IANA string like 'US/Pacific')], (3) calendar[date (YYYY-MM-DD), is_holiday (bool), shift in {'A','B','C'}]. Data issues: late-arriving events up to 48 hours late, duplicate events (same event_id with ts_utc differences up to ±2 seconds), daylight saving transitions, and missing telemetry rows. Memory budget is 1 GB, total rows ≈50M. Tasks: a) Normalize all time to a single axis; deduplicate events with a deterministic rule (state your rule) while preserving correct event order. b) For the last 7 calendar days up to and including today=2025-09-01 in each machine’s local time, compute per-machine hourly features: throughput (count of completed start→stop cycles), 95th percentile temperature, and a rolling 24-hour z-score of power_kW. Handle missing hours and DST gaps/overlaps correctly. c) Join features into a tidy machine-hour panel indexed by [machine_id, hour_start_utc); impute missing values robustly; flag anomalies where |z|>3. Provide Pandas code snippets and explain performance tactics (chunked IO, dtypes, categoricals, Parquet, vectorized ops) and how you would test correctness on edge cases.

PracHub · Accepted Answer

This question evaluates a candidate's ability to clean and aggregate large-scale time-series data in Pandas, emphasizing timezone normalization, deterministic deduplication, DST-aware alignment, event-sequence aggregation, rolling statistical features, anomaly detection, and memory- and performance-conscious ETL techniques.

Quick Overview

Quick Overview