Debug a Hive Query for DAU
Company: TikTok
Role: Data Engineer
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
You are given two Hive tables: users(user_id BIGINT, created_at TIMESTAMP) and events(user_id BIGINT, event_time TIMESTAMP, event_name STRING) PARTITIONED BY (event_date STRING in 'YYYY-MM-DD'). A teammate wrote the query: "SELECT u.user_id, COUNT(DISTINCT e.user_id) AS dau FROM users u LEFT JOIN events e ON u.user_id = e.user_id WHERE DATE(e.event_time) = '2025-08-15' GROUP BY u.user_id;" This is intended to return the site-wide Daily Active Users for 2025-08-15. Identify at least three bugs or inefficiencies (e.g., join semantics, grouping grain, partition pruning, time handling), rewrite a correct and efficient Hive-compatible query that outputs a single DAU number for that date (assuming relevant event_names define activity), and explain how you would validate correctness and performance (test cases, edge cases, and use of partitions/statistics).
Quick Answer: This question evaluates proficiency in Hive/SQL query formulation, data partitioning and pruning, join and aggregation semantics, timestamp handling, and performance tuning for large-scale analytics.