SQL Joins And Aggregations
Asked of: Data Scientist
Last updated

What's being tested — 2-3 sentences. Candidates must show correct use of join types and aggregation to produce accurate, de-duplicated metrics across tables. Interviewers look for reasoning about join direction, where to filter, grouping keys, COUNT vs COUNT(DISTINCT), and performance implications on real datasets.
Core knowledge
- INNER vs LEFT/RIGHT/FULL JOIN semantics and when NULLs appear on non-matching side.
- Many-to-many joins cause row explosion; pre-aggregate or dedupe before joining.
- WHERE filters apply after join; moving predicates into join condition preserves LEFT JOIN rows.
- GROUP BY defines result granularity; every non-aggregated column must be grouped.
- COUNT(*) counts rows, COUNT(col) ignores NULLs, COUNT(DISTINCT col) is expensive.
- Use window functions (ROW_NUMBER, SUM OVER) for per-row ranks and running aggregates.
- Approximate distinct (HyperLogLog / APPROX_COUNT_DISTINCT) for large-scale distinct counts.
Worked example Title: "Compute monthly active users (MAU) by platform" Frame it by first clarifying table schemas (events: user_id, ts, platform; users: user_id, signup_ts). Ask whether platform is in events or users and the time window semantics. Decide deduplication: define MAU as distinct user_id with any event in month — so aggregate events to one row per user-month (COUNT DISTINCT user_id or pre-aggregate by user_id, month). Choose JOIN only if enriching with user attributes; prefer LEFT JOIN users → events if you must keep users with zero events. Explain tradeoffs: COUNT(DISTINCT) vs pre-group + join, and whether to use approximate distinct for scale.
A common pitfall The tempting but wrong approach is joining raw events to users and then COUNT(*) grouped by user and month. That overcounts when events are multiple per user, or when enriching with tables that multiply rows (e.g., user -> device many-to-many). Also mistakenly placing a filter like WHERE device IS NOT NULL after a LEFT JOIN converts it effectively to an INNER JOIN, dropping desired rows. Always reason about cardinality, dedupe before aggregating, and push selective predicates appropriately.
Further reading
- PostgreSQL documentation: JOINs and query planning (postgresql.org/docs).
- Google BigQuery docs: APPROX_COUNT_DISTINCT and COUNT(DISTINCT) tradeoffs.