SQL Joins, Deduplication, and Event Attribution
Asked of: Data Scientist
Last updated

-
What it is SQL joins combine rows across tables on matching keys. Deduplication removes repeated records, usually keeping a single “best” row per entity. Event attribution assigns credit for an outcome (e.g., purchase) to the right upstream touchpoint(s) within defined rules and time windows.
-
Why interviewers ask about it Data scientists at product-growth companies (e.g., Meta) must ship correct metrics and experiments from messy logs. Most failures come from the data layer: bad joins that fan out rows, sloppy dedup, or attribution rules that leak credit. Showing mastery signals you can build trustworthy funnels, cohorts, and lift estimates at scale in BigQuery/Presto/Snowflake.
-
Core ideas to know
- Inner vs left/right/full/cross joins; choose based on whether you must keep unmatched rows.
- Prevent fanout: confirm join cardinality; pre-aggregate to one row per key before joining.
- Dedup with ROW_NUMBER() PARTITION BY keys ORDER BY a deterministic tiebreaker; filter rn = 1.
- Prefer QUALIFY (Snowflake/BigQuery) or a subquery/CTE to filter window functions cleanly.
- Define idempotent dedup keys: natural business key plus ingestion or event timestamp.
- Attribution needs explicit rules: window length, last/first touch priority, tie-breakers, fallbacks.
- Validate with row-count checks, anti-joins to find orphans, and before/after metric diffs.
-
A common pitfall Candidates write a left join to “keep everything,” then see metrics double because the right table had multiple matches per key. They reach for SELECT DISTINCT or GROUP BY to hide the symptom, not fix the one-to-many root cause. The right fix is to pre-aggregate or deduplicate the right table to one row per join key, then join. Another frequent miss is nondeterministic dedup (no stable ORDER BY), causing flaky attribution when two events tie within the window.
-
Further reading
- PostgreSQL 16 docs — Table expressions and JOIN syntax: authoritative reference on inner/outer/cross joins, USING/NATURAL forms, and join semantics. https://www.postgresql.org/docs/16/queries-table-expressions.html
- BigQuery docs — Deduplicate duplicate records with ROW_NUMBER(): canonical pattern for keeping one row per entity using window functions. https://cloud.google.com/bigquery/docs/nested-repeated#deduplicate
- Amazon Ads Multi‑Touch Attribution (arXiv, 2025) — modern, large‑scale MTA system design and modeling considerations from an industry team. https://arxiv.org/abs/2508.08209