Compute Moderation View Metrics
Company: Airwallex
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: medium
Interview Round: Onsite
You are working on a content moderation analytics problem with two tables.
Table 1: `content_views`
- `user_id` INTEGER: user who viewed the post
- `post_id` INTEGER: viewed post ID
- `view_count` INTEGER: number of views generated by that user for that post on that date
- `view_date` STRING: date of the views in `YYYY-MM-DD` format
- Primary key: (`user_id`, `post_id`, `view_date`)
Table 2: `post_policy_scores`
- `post_id` INTEGER: post ID
- `violation_type` STRING: policy category such as `Spam`, `Scam`, `Nudity`, or `Harassment`
- `probability_violating` DOUBLE: model-estimated probability that the post violates that policy
- Primary key: (`post_id`, `violation_type`)
Relationship:
- `content_views.post_id` joins to `post_policy_scores.post_id`
Assumptions:
- Treat `view_date` as a UTC date string and cast it to `DATE` when needed.
- Let `analysis_date` be the maximum `view_date` present in `content_views`.
- The last 7 days and last 30 days are inclusive windows ending on `analysis_date`.
- For classification-based questions, define a post as belonging to a violation type when `probability_violating > 0.5`.
- For monthly aggregation, use the calendar month derived from `view_date`.
Write SQL for the following tasks:
1. Count how many distinct posts received at least 10 total views in the last 7 days.
- The 10-view filter must be applied after grouping by `post_id`.
- Required output column: `num_posts`
2. Compute Nudity view prevalence in the last 30 days, defined as:
`total views on posts classified as Nudity / total views on all posts in that window`
- Required output columns: `nudity_views`, `total_views`, `nudity_view_prevalence`
3. Compute the monthly violation-type distribution.
- For each calendar month and each `violation_type`, among post-violation pairs where `probability_violating > 0.5` and the post received at least one view in that month, report:
- `month`
- `violation_type`
- `flagged_post_count`: number of distinct posts in that month assigned to that violation type
- `distribution_pct`: `flagged_post_count / total flagged post-violation pairs in that month`
Quick Answer: This question evaluates proficiency in data manipulation and analytical querying, covering time-windowed aggregations, joins between view and policy-score tables, classification-based filtering, date casting, and calculation of prevalence and distribution metrics within the Data Manipulation (SQL/Python) domain.