SQL Joins, Aggregations, And Deduplication
Asked of: Data Scientist
Last updated

What's being tested
You’re being tested on whether you can turn messy event-level data into reliable metrics without overcounting. Strong answers show command of join cardinality, null behavior, aggregation grain, and deduplication logic before writing SQL.
Core knowledge
- Always identify the grain first: user-day, session, event, impression, order, or experiment unit.
- One-to-many joins can inflate aggregates; pre-aggregate or deduplicate before joining when needed.
COUNT(*),COUNT(column), andCOUNT(DISTINCT column)answer different questions.- Use
ROW_NUMBER()withPARTITION BYand deterministicORDER BYfor deduplication. LEFT JOINpreserves unmatched left rows; filtering right-table columns inWHEREcan turn it into an inner join.- Aggregate after joining only if the joined table is at the same grain or intentionally expands rows.
- For Meta-style event data, handle duplicate logs, late arrivals, bot/test traffic, and multiple devices per user.
Worked example
Deduplicating events before aggregating metrics
A strong candidate would first define the metric grain: for example, “daily active users” is one row per user_id, date, while “click-through rate” may be per impression or per user depending on the prompt. Before joining clicks to impressions, they would check whether clicks can have multiple rows per impression and whether impressions are uniquely keyed. They would likely create CTEs: one to deduplicate raw events using ROW_NUMBER(), one to aggregate to the desired grain, and then a final join/aggregation. They would also call out validation checks, such as comparing row counts before and after joins and confirming that denominators are not inflated.
A common pitfall
A tempting mistake is to join raw event tables directly and then aggregate, assuming SQL will “average out” duplicates. For example, joining users to sessions to clicks can multiply user rows and make COUNT(user_id) or revenue totals too high. Another common error is using COUNT(DISTINCT user_id) as a bandage without understanding whether the numerator and denominator are now measured at inconsistent grains. In interviews, this reads as metric hacking rather than analytical rigor.