Clean and aggregate factory event data in Pandas

Q: Clean and aggregate factory event data in Pandas

This is a Data Manipulation (SQL/Python) interview question from Roblox for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

You are given three Pandas DataFrames for a factory: (1) events[event_id, machine_id, ts_utc (datetime64[ns, UTC]), event_type in {'start','stop','fault'}, batch_id], (2) telemetry[machine_id, ts_local (datetime64[ns]), temperature_C, rpm, power_kW, timezone (IANA string like 'US/Pacific')], (3) calendar[date (YYYY-MM-DD), is_holiday (bool), shift in {'A','B','C'}]. Data issues: late-arriving events up to 48 hours late, duplicate events (same event_id with ts_utc differences up to ±2 seconds), daylight saving transitions, and missing telemetry rows. Memory budget is 1 GB, total rows ≈50M. Tasks: a) Normalize all time to a single axis; deduplicate events with a deterministic rule (state your rule) while preserving correct event order. b) For the last 7 calendar days up to and including today=2025-09-01 in each machine’s local time, compute per-machine hourly features: throughput (count of completed start→stop cycles), 95th percentile temperature, and a rolling 24-hour z-score of power_kW. Handle missing hours and DST gaps/overlaps correctly. c) Join features into a tidy machine-hour panel indexed by [machine_id, hour_start_utc); impute missing values robustly; flag anomalies where |z|>3. Provide Pandas code snippets and explain performance tactics (chunked IO, dtypes, categoricals, Parquet, vectorized ops) and how you would test correctness on edge cases.

Clean and aggregate factory event data in Pandas

Comments (0)