Cohorts And Window Functions In SQL
Asked of: Data Scientist
Last updated

What's being tested
Ability to design and compute cohort retention and behaviors using SQL window (analytic) functions, plus practical choices: cohort definition, deduplication, time bucketing, and performance tradeoffs on large event tables.
Core knowledge
- PARTITION BY + ORDER BY define the window; frame clauses (ROWS/RANGE) control rows included.
- ROW_NUMBER/DENSE_RANK to pick first event; MIN(...) OVER(PARTITION BY ...) for cohort_date.
- LAG/LEAD for gaps/sessionization; use SUM(...) OVER to assign session ids from gap flags.
- Cumulative metrics: SUM(...) OVER(ORDER BY ... ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW).
- Time functions: DATE_TRUNC/TIMESTAMP_TRUNC, TIMESTAMP_DIFF/DATE_DIFF to compute offsets.
- Pivoting retention: conditional aggregation (COUNTIF/SUM(CASE WHEN ...)) after dedupe.
- Performance: window functions sort data; cluster/partition tables (BigQuery clustering, Postgres indexes) and pre-aggregate to scale.
Worked example — "Build weekly user retention cohorts" (framing)
First define cohort_start as each user's first event week: cohort_week = DATE_TRUNC('week', MIN(event_ts) OVER (PARTITION BY user_id)). Next, assign each event to an event_week and compute week_offset = DATE_DIFF(event_week, cohort_week, WEEK). Deduplicate per user-week (keep first event per user-week) or aggregate unique users. Finally aggregate counts per (cohort_week, week_offset) and pivot to a retention matrix. Before writing final SQL, call out timezone handling, partial weeks at edges, and whether retention counts should be unique users or total events.
A common pitfall
A tempting but wrong approach is filtering the event table to a date range first, then computing cohort_start — this loses users whose true first event is outside the range and skews cohort assignment. Another common error is using COUNT(*) for retention instead of deduplicated user counts, double-counting multiple events per user in a period. Also watch RANGE vs ROWS with timestamps: RANGE can behave unexpectedly with non-integer ORDER BY types.
Further reading
- PostgreSQL docs: Window Functions — https://www.postgresql.org/docs/current/functions-window.html
- BigQuery Analytic Functions — https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-functions
Related concepts
- Window Functions, Cohorting, and Time Series SQL
- SQL Joins, Aggregation, And Window Functions
- SQL Joins, Aggregations, And Window Functions
- SQL Window Functions And Temporal JoinsData Manipulation (SQL/Python)
- SQL Window Functions And AnalyticsData Manipulation (SQL/Python)
- Window FunctionsData Manipulation (SQL/Python)