Compute violation rate and flag precision in SQL
Company: Google
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Onsite
You are analyzing a Trust & Safety product in BigQuery. Assume 'today' is 2025-09-01 (UTC). Define precise metrics and write SQL to compute them, being explicit about edge cases.
Schema (invented):
- users(user_id INT, country STRING, signup_time TIMESTAMP)
- events(user_id INT, event_time TIMESTAMP, event_type STRING) -- event_type in {'post','comment','message','login'}
- content(content_id INT, uploader_user_id INT, created_time TIMESTAMP, country STRING)
- moderation_decisions(content_id INT, decision STRING, decision_time TIMESTAMP) -- decision in {'violation','clean','needs_review'}; multiple rows per content over time possible
- auto_flags(content_id INT, flag_time TIMESTAMP, rule_id STRING)
Sample rows:
users
+----------+---------+---------------------+
| user_id | country | signup_time |
+----------+---------+---------------------+
| 1 | US | 2025-01-05 10:00:00 |
| 2 | IN | 2025-07-12 08:00:00 |
| 3 | US | 2025-08-20 12:00:00 |
| 4 | BR | 2025-06-01 09:30:00 |
+----------+---------+---------------------+
events
+----------+---------------------+-------------+
| user_id | event_time | event_type |
+----------+---------------------+-------------+
| 1 | 2025-08-29 02:00:00 | post |
| 1 | 2025-08-30 03:00:00 | login |
| 2 | 2025-08-28 11:00:00 | comment |
| 3 | 2025-08-27 16:00:00 | message |
| 4 | 2025-07-15 18:00:00 | post |
+----------+---------------------+-------------+
content
+------------+-------------------+---------------------+---------+
| content_id | uploader_user_id | created_time | country |
+------------+-------------------+---------------------+---------+
| 100 | 1 | 2025-08-28 05:00:00 | US |
| 101 | 2 | 2025-08-26 10:00:00 | IN |
| 102 | 3 | 2025-08-31 01:00:00 | US |
| 103 | 4 | 2025-07-01 06:00:00 | BR |
+------------+-------------------+---------------------+---------+
moderation_decisions
+------------+-----------+---------------------+
| content_id | decision | decision_time |
+------------+-----------+---------------------+
| 100 | violation | 2025-08-31 07:00:00 |
| 100 | clean | 2025-09-02 01:00:00 |
| 101 | clean | 2025-08-27 12:00:00 |
| 102 | violation | 2025-09-01 10:00:00 |
| 103 | violation | 2025-08-10 09:00:00 |
+------------+-----------+---------------------+
auto_flags
+------------+---------------------+----------+
| content_id | flag_time | rule_id |
+------------+---------------------+----------+
| 100 | 2025-08-31 06:30:00 | R1 |
| 101 | 2025-08-26 10:30:00 | R2 |
| 102 | 2025-09-01 09:00:00 | R1 |
| 102 | 2025-09-01 09:05:00 | R1 |
+------------+---------------------+----------+
Task A — 7-day Confirmed Violation Rate per 1,000 Active Users by country:
- Window: [2025-08-26 00:00:00, 2025-09-02 00:00:00) (the last 7 full days ending at 'today' inclusive via half-open interval).
- Active users: distinct users with ≥1 events where event_type IN ('post','comment','message') within the window.
- Country attribution: use users.country for the uploader_user_id; if NULL, fallback to content.country.
- Violations counted: distinct content_id with at least one decision='violation' whose decision_time falls within the window AND whose content.created_time < window_end (exclude content created after the window). If multiple decisions for the same content exist within the window, use the latest within-window decision only; if both 'clean' and 'violation' occur within the window, treat as violation if any within-window decision is 'violation'. Ignore decisions outside the window when determining within-window status.
- Output columns: country, window_start, window_end, active_users, violations, violation_rate_per_1000 (round to 2 decimals).
- Write a single BigQuery-standard SQL query to produce this.
Task B — 28-day Auto-Flag Precision by rule_id:
- Flag cohort: auto_flags where flag_time in [2025-08-05 00:00:00, 2025-09-02 00:00:00).
- Deduping: if multiple flags exist for the same content_id and rule_id within the cohort, keep only the earliest flag for that (content_id, rule_id) pair.
- Match window: a flag is a true positive if a moderation_decisions row with decision='violation' occurs within (flag_time, flag_time + 48 hours]. If multiple decisions occur in that 48h window, use the latest; 'needs_review' and 'clean' do not count as violations. Decisions after 48h do not count for precision.
- Output columns: rule_id, flags, matched_violations, precision = matched_violations / flags.
- Write a single BigQuery-standard SQL query to produce this.
Edge cases to handle explicitly in your SQL: time zones (assume all timestamps UTC), late-arriving decisions (outside windows are ignored), users with no country (coalesce with content.country), and zero-division (precision should be 0.0 if flags=0).
Quick Answer: This question evaluates proficiency in analytical SQL and data engineering concepts—specifically time-windowed metric computation, user activity definitions, join and attribution logic, deduplication, and handling temporal moderation and auto-flag events in BigQuery.