##### Question You are working on a content moderation analytics problem. You have two tables. **Table 1: `content_views`** (one row per user / post / day) - `user_id` INTEGER: user who viewed the post - `post_id` INTEGER: viewed post ID - `view_count` INTEGER: number of views that user generated for that post on that date - `view_date` STRING: date of the views in `YYYY-MM-DD` format - Primary key: (`user_id`, `post_id`, `view_date`) **Table 2: `post_policy_scores`** (one row per post / violation type) - `post_id` INTEGER: post ID - `violation_type` STRING: policy category such as `Spam`, `Scam`, `Nudity`, or `Harassment` - `probability_violating` DOUBLE: model-estimated probability that the post violates that policy - Primary key: (`post_id`, `violation_type`) **Relationship:** one post can have multiple rows in `post_policy_scores`, one per possible violation type. `content_views.post_id` joins to `post_policy_scores.post_id`. **Assumptions** - Treat `view_date` as a UTC date string and cast it to `DATE` when needed. - Let `analysis_date` be the maximum `view_date` present in `content_views`. - The last 7 days and last 30 days are inclusive windows ending on `analysis_date`. - For classification-based questions, treat a post as belonging to a violation type when `probability_violating > 0.5`. - For monthly aggregation, use the calendar month derived from `view_date`. Write SQL for the following tasks: 1. **Posts with at least 10 views (last 7 days).** Count how many distinct posts received at least 10 total views in the last 7 days. Aggregate `view_count` across all users and all dates in the window, and apply the 10-view filter *after* grouping by `post_id`. Required output column: `num_posts`. 2. **Nudity view prevalence (last 30 days).** Define a post as a `Nudity` post if it has a `post_policy_scores` row with `violation_type = 'Nudity'` and `probability_violating > 0.5`. Compute: `view prevalence = total views on Nudity posts in the last 30 days / total views on all posts in the last 30 days`. Required output columns: `nudity_views`, `total_views`, `nudity_view_prevalence`. 3. **Monthly violation-type distribution.** For each calendar month (derived from `view_date`) and each `violation_type`, among post/violation pairs where `probability_violating > 0.5` and the post received at least one view in that month, report the number of distinct posts assigned to that violation type and that type's share of all flagged post/violation pairs in the month. Required output columns: `month`, `violation_type`, `flagged_post_count`, `distribution_pct`.

An Airwallex Data Scientist onsite SQL question on content-moderation analytics. You write three queries over a per-day view table and a one-to-many post/violation-type policy-score table: distinct posts with 10+ views in the last 7 days, Nudity view prevalence over the last 30 days, and a monthly violation-type distribution. The core challenge is windowed aggregation plus avoiding view double-counting across violation types.

How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

What difficulty level is this interview question?

This is a medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Onsite rounds at Airwallex.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Airwallex during technical interviews.

Compute Moderation View Metrics | Airwallex Interview Question

Question

You are working on a content moderation analytics problem. You have two tables.

Table 1: content_views (one row per user / post / day)

user_id INTEGER: user who viewed the post
post_id INTEGER: viewed post ID
view_count INTEGER: number of views that user generated for that post on that date
view_date STRING: date of the views in YYYY-MM-DD format
Primary key: ( user_id , post_id , view_date )

Table 2: post_policy_scores (one row per post / violation type)

post_id INTEGER: post ID
violation_type STRING: policy category such as Spam , Scam , Nudity , or Harassment
probability_violating DOUBLE: model-estimated probability that the post violates that policy
Primary key: ( post_id , violation_type )

Relationship: one post can have multiple rows in post_policy_scores, one per possible violation type. content_views.post_id joins to post_policy_scores.post_id.

Assumptions

Treat view_date as a UTC date string and cast it to DATE when needed.
Let analysis_date be the maximum view_date present in content_views .
The last 7 days and last 30 days are inclusive windows ending on analysis_date .
For classification-based questions, treat a post as belonging to a violation type when probability_violating > 0.5 .
For monthly aggregation, use the calendar month derived from view_date .

Write SQL for the following tasks:

Posts with at least 10 views (last 7 days). Count how many distinct posts received at least 10 total views in the last 7 days. Aggregate view_count across all users and all dates in the window, and apply the 10-view filter after grouping by post_id . Required output column: num_posts .
Nudity view prevalence (last 30 days). Define a post as a Nudity post if it has a post_policy_scores row with violation_type = 'Nudity' and probability_violating > 0.5 . Compute: view prevalence = total views on Nudity posts in the last 30 days / total views on all posts in the last 30 days . Required output columns: nudity_views , total_views , nudity_view_prevalence .
Monthly violation-type distribution. For each calendar month (derived from view_date ) and each violation_type , among post/violation pairs where probability_violating > 0.5 and the post received at least one view in that month, report the number of distinct posts assigned to that violation type and that type's share of all flagged post/violation pairs in the month. Required output columns: month , violation_type , flagged_post_count , distribution_pct .

Question

You are working on a content moderation analytics problem. You have two tables.

Table 1: content_views (one row per user / post / day)

user_id INTEGER: user who viewed the post
post_id INTEGER: viewed post ID
view_count INTEGER: number of views that user generated for that post on that date
view_date STRING: date of the views in YYYY-MM-DD format
Primary key: ( user_id , post_id , view_date )

Table 2: post_policy_scores (one row per post / violation type)

post_id INTEGER: post ID
violation_type STRING: policy category such as Spam , Scam , Nudity , or Harassment
probability_violating DOUBLE: model-estimated probability that the post violates that policy
Primary key: ( post_id , violation_type )

Relationship: one post can have multiple rows in post_policy_scores, one per possible violation type. content_views.post_id joins to post_policy_scores.post_id.

Assumptions

Treat view_date as a UTC date string and cast it to DATE when needed.
Let analysis_date be the maximum view_date present in content_views .
The last 7 days and last 30 days are inclusive windows ending on analysis_date .
For classification-based questions, treat a post as belonging to a violation type when probability_violating > 0.5 .
For monthly aggregation, use the calendar month derived from view_date .

Write SQL for the following tasks:

Posts with at least 10 views (last 7 days). Count how many distinct posts received at least 10 total views in the last 7 days. Aggregate view_count across all users and all dates in the window, and apply the 10-view filter after grouping by post_id . Required output column: num_posts .
Nudity view prevalence (last 30 days). Define a post as a Nudity post if it has a post_policy_scores row with violation_type = 'Nudity' and probability_violating > 0.5 . Compute: view prevalence = total views on Nudity posts in the last 30 days / total views on all posts in the last 30 days . Required output columns: nudity_views , total_views , nudity_view_prevalence .
Monthly violation-type distribution. For each calendar month (derived from view_date ) and each violation_type , among post/violation pairs where probability_violating > 0.5 and the post received at least one view in that month, report the number of distinct posts assigned to that violation type and that type's share of all flagged post/violation pairs in the month. Required output columns: month , violation_type , flagged_post_count , distribution_pct .

Compute Moderation View Metrics

Quick Overview

Question

Solution

Submit Your Answer to Earn 20XP

Compute Moderation View Metrics

Quick Overview

Question

Solution

Submit Your Answer to Earn 20XP