Analyze time-zoned events with pandas

Q: Analyze time-zoned events with pandas

This is a Data Manipulation (SQL/Python) interview question from Voleon Group for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

You are given two pandas DataFrames. events columns: user_id:int, ts:str ISO-8601 with timezone (e.g., '2025-08-31T23:58:43-07:00'), event:str in {'signup','login','purchase'}, revenue:str that may include currency symbols (e.g., '$12.34', '€9,50') or be null, device_id:str, session_id:str (may be duplicated), source:str (may be null). users columns: user_id:int, signup_ts:str UTC, tz:str IANA timezone (e.g., 'America/Los_Angeles'), is_bot:bool, country:str, plan:str in {'free','pro','enterprise'}. Tasks: (1) Clean and normalize: make all timestamps timezone-aware; deduplicate events defined as same (user_id, event, session_id) within a 5-minute window keeping the earliest ts; drop events for users missing in users; explain how you would do this without chained assignment. (2) Define an "active day" as any local-day (in each user's tz) with at least one non-'signup' event. Compute DAU per local date for 2025-08-25 through 2025-09-01 inclusive, excluding users with is_bot=True and excluding device_ids that appear on ≥20 distinct user_id values (shared devices). Return a DataFrame with local_date, dau. (3) Compute 7-day retention for cohorts with signup_date (UTC) between 2025-08-24 and 2025-08-31. A user is "retained" if they have any non-'signup' event on the 7th day after signup in their local tz. Return columns: signup_date, cohort_size, retained, retention_rate, and a Wilson 95% CI for the rate. (4) Parse revenue strings and convert to USD given an FX mapping fx = {'USD':1.0,'EUR':1.08,'JPY':0.0068,...} inferred from symbols; impute missing revenue for 'purchase' rows using the median revenue over the last 30 days ending 2025-09-01 within (country, plan) groups; then compute ARPU for the last 7 days ending 2025-09-01 for non-bot users. (5) Provide vectorized pandas code sketches (no Python loops), discuss expected memory footprint and computational complexity for 100 million events, and outline a chunked processing strategy to keep peak RAM under 8 GB (e.g., category dtypes, read_csv dtype map, sorted merges, and groupby with observed=True).

Analyze time-zoned events with pandas

Comments (0)