Cohorts And Window Functions In SQL — Tech Interview Concept

What's being tested
Ability to design and compute cohort retention and behaviors using SQL window (analytic) functions, plus practical choices: cohort definition, deduplication, time bucketing, and performance tradeoffs on large event tables.

Core knowledge

PARTITION BY + ORDER BY define the window; frame clauses (ROWS/RANGE) control rows included.
ROW_NUMBER/DENSE_RANK to pick first event; MIN(...) OVER(PARTITION BY ...) for cohort_date.
LAG/LEAD for gaps/sessionization; use SUM(...) OVER to assign session ids from gap flags.
Cumulative metrics: SUM(...) OVER(ORDER BY ... ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW).
Time functions: DATE_TRUNC/TIMESTAMP_TRUNC, TIMESTAMP_DIFF/DATE_DIFF to compute offsets.
Pivoting retention: conditional aggregation (COUNTIF/SUM(CASE WHEN ...)) after dedupe.
Performance: window functions sort data; cluster/partition tables (BigQuery clustering, Postgres indexes) and pre-aggregate to scale.

Worked example — "Build weekly user retention cohorts" (framing)
First define cohort_start as each user's first event week: cohort_week = DATE_TRUNC('week', MIN(event_ts) OVER (PARTITION BY user_id)). Next, assign each event to an event_week and compute week_offset = DATE_DIFF(event_week, cohort_week, WEEK). Deduplicate per user-week (keep first event per user-week) or aggregate unique users. Finally aggregate counts per (cohort_week, week_offset) and pivot to a retention matrix. Before writing final SQL, call out timezone handling, partial weeks at edges, and whether retention counts should be unique users or total events.

A common pitfall
A tempting but wrong approach is filtering the event table to a date range first, then computing cohort_start — this loses users whose true first event is outside the range and skews cohort assignment. Another common error is using COUNT(*) for retention instead of deduplicated user counts, double-counting multiple events per user in a period. Also watch RANGE vs ROWS with timestamps: RANGE can behave unexpectedly with non-integer ORDER BY types.

Further reading

PostgreSQL docs: Window Functions — https://www.postgresql.org/docs/current/functions-window.html
BigQuery Analytic Functions — https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-functions

Related concepts