Rate your SQL proficiency across querying, modeling, and operations, then substantiate it with evidence: 1) Describe a complex query you authored (window functions, CTEs, conditional aggregation, or recursion). Include the business goal, data size, and why the final design was chosen over alternatives. 2) Provide a performance win you delivered: baseline runtime/cost, bottleneck diagnosis (planner/explain findings), concrete changes (indexes, rewrites, partitioning, statistics), and measurable after metrics. 3) Explain a safe deployment you led: code review standards, migration plan (DDL with backfill, feature flags), rollback strategy, and how you validated data correctness (row counts, checksums, invariants). 4) Share how you mentor or review others’ SQL: common anti-patterns you look for and a recent example where feedback improved reliability or speed.
Quick Answer: This question evaluates a data scientist's SQL proficiency across querying (window functions, CTEs, joins, aggregation), data modeling (normalization, dimensional modeling, slowly changing dimensions), and operations (performance tuning, deployments, observability), while also assessing behavioral and leadership competencies such as mentorship, code review, and safe deployment practices. It is commonly asked to confirm practical application on analytical workloads and production safety, testing the SQL/data-engineering domain with both hands-on practical tasks and higher-level conceptual understanding of performance, migration strategies, and review processes.
Solution
# Self‑Rating
- Querying: Expert (9/10). Daily use of window functions, CTE pipelines, conditional aggregation, lateral joins; comfortable with tricky edge cases (islands and gaps, sargability, null semantics).
- Data Modeling: Strong (8/10). Star/snowflake schemas for analytics, SCD types 1/2, snapshotting, surrogate keys, and constraints; pragmatic denormalization for read performance.
- Operations: Strong (8/10). EXPLAIN‑driven tuning, indexing, partitioning/clustering, statistics management; safe migrations (expand/migrate/contract), data validation, and rollback plans.
# 1) Complex Query Example — Rolling 90‑Day Cure Rate After 60+ Delinquency
Context: Compute quarterly cohort metrics for loans: when a loan first hits 60+ days delinquent in a quarter, did it cure (return to “current”) within 90 days? Results feed risk monitoring and provisioning.
- Data: loan_status_daily (~900M rows over 5 years; columns: loan_id, as_of_date, status in {current, D30, D60, D90, foreclosure}); loans_dim (~12M loans; origination_date, product_type, region).
- Platform: PostgreSQL 13 (partitioned by as_of_date, clustered by (loan_id, as_of_date)).
- Design choices: window functions + CTEs for “islands and gaps” over recursion/correlated subqueries. Chosen for linear scans, partition pruning by date, and planner‑friendly operations. Recursion worked in prototypes but was slower and harder to reason about at scale.
Small illustrative data (per loan):
- For a given loan: current, current, D30, D60, D60, D90, current, current...
- First 60+ day event is first row where status ∈ {D60, D90, foreclosure} and the previous day was not 60+.
- Cure date is the next row where status = current.
Core query sketch:
```sql
WITH daily AS (
SELECT
s.loan_id,
s.as_of_date,
s.status,
(s.status IN ('D60','D90','foreclosure')) AS is_d60p,
LAG(s.status IN ('D60','D90','foreclosure')) OVER (
PARTITION BY s.loan_id ORDER BY s.as_of_date
) AS prev_is_d60p
FROM loan_status_daily s
WHERE s.as_of_date >= DATE '2023-01-01'
AND s.as_of_date < DATE '2024-01-01' -- partition pruning
), d60p_starts AS (
SELECT
loan_id,
as_of_date AS d60p_start_date,
/* next current date after start */
MIN(CASE WHEN status = 'current' THEN as_of_date END) OVER (
PARTITION BY loan_id
ORDER BY as_of_date
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
) AS next_current_date
FROM daily
WHERE is_d60p = TRUE AND COALESCE(prev_is_d60p, FALSE) = FALSE
), episodes AS (
SELECT
loan_id,
d60p_start_date,
next_current_date,
CASE
WHEN next_current_date IS NOT NULL
AND next_current_date <= d60p_start_date + INTERVAL '90 days'
THEN 1 ELSE 0 END AS cured_within_90
FROM d60p_starts
), cohort AS (
SELECT
e.loan_id,
DATE_TRUNC('quarter', l.origination_date) AS origination_qtr,
e.d60p_start_date,
e.cured_within_90
FROM episodes e
JOIN loans_dim l USING (loan_id)
)
SELECT
origination_qtr,
COUNT(*) AS d60p_episodes,
SUM(cured_within_90) AS cured_episodes,
ROUND(100.0 * SUM(cured_within_90)::numeric / NULLIF(COUNT(*),0), 2)
AS cure_rate_pct
FROM cohort
GROUP BY origination_qtr
ORDER BY origination_qtr;
```
Why this design
- Window MIN(...) to find the next current date is linear per loan and avoids N+1 lookups.
- CTEs improve readability; in Postgres 13 the planner inlines many CTEs, avoiding the old optimization fence.
- Partition pruning by date reduces scanned partitions; clustering by (loan_id, as_of_date) improves window locality.
- Alternatives considered: recursive CTE to build episodes (slower at scale); correlated subqueries (O(n^2) pattern on big tables).
Pitfalls and guardrails
- Ensure status is daily and gap‑free per loan; if gaps exist, pad or adjust logic.
- Define cure precisely; business may require N consecutive “current” days—then compute a rolling window of consecutive current days.
- Validate with spot checks: sample 100 loans, manually trace timelines; verify cohort counts against historical reports.
# 2) Performance Win — 10x Faster Monthly Exposure Rollup
Problem: A monthly exposure rollup joined loans to transactions to compute balances and flags. It ran in ~32 minutes and sometimes hit CPU/I/O limits.
- Platform: PostgreSQL 13; fact_transactions ~180M rows/month, partitioned by tran_date; loans_dim ~12M.
- Baseline (EXPLAIN ANALYZE highlights):
- Nested Loop join from loans_dim to fact_transactions due to non‑sargable date predicate (CAST on column) and missing composite index.
- 48M rows removed by filter; high random I/O.
Original (simplified):
```sql
SELECT l.id, DATE_TRUNC('month', t.tran_ts) AS mth,
SUM(t.amount) AS principal
FROM loans_dim l
JOIN fact_transactions t
ON t.loan_id = l.id
WHERE DATE(t.tran_ts) BETWEEN DATE '2023-01-01' AND DATE '2023-12-31'
GROUP BY 1,2;
```
Diagnosis
- The WHERE clause applies DATE() to t.tran_ts, blocking index and partition pruning (non‑sargable).
- No composite index on (loan_id, tran_ts); planner chose Nested Loop.
Fixes
1) Make filter sargable and align with partitions:
```sql
WHERE t.tran_ts >= TIMESTAMP '2023-01-01'
AND t.tran_ts < TIMESTAMP '2024-01-01'
```
2) Add composite index and analyze stats:
```sql
CREATE INDEX CONCURRENTLY idx_txn_loan_ts ON fact_transactions (loan_id, tran_ts);
ANALYZE fact_transactions;
```
3) Pre‑aggregate to reduce data volume scanned:
```sql
WITH txn AS (
SELECT loan_id,
DATE_TRUNC('month', tran_ts) AS mth,
SUM(amount) AS principal
FROM fact_transactions
WHERE tran_ts >= '2023-01-01' AND tran_ts < '2024-01-01'
GROUP BY 1,2
)
SELECT l.id, t.mth, t.principal
FROM loans_dim l
JOIN txn t ON t.loan_id = l.id;
```
4) Verify plan switched to Hash Join + Bitmap/Index scans and that partitions were pruned by date.
Results
- Runtime: 32:10 → 3:05 (≈10.4x faster).
- I/O: ~1.1B heap fetches → ~120M; CPU reduced proportionally.
- Stability: 99th percentile latency under SLA on peak days.
Guardrails
- Ensure index creation is CONCURRENTLY to avoid write locks.
- Confirm cardinality estimates improved after ANALYZE.
- Add a regression benchmark in CI to catch future non‑sargable predicates.
# 3) Safe Deployment — Adding risk_tier With Backfill and Zero Downtime
Change: Introduce risk_tier (derived from risk_score) on loans_dim for downstream models; must not break existing consumers.
Standards and review
- SQL style/lint (sqlfluff), naming conventions, explicit types, deterministic UDFs only.
- Unit checks in staging (row counts, NOT NULL/unique expectations via db‑tests), EXPLAIN review for new/changed queries.
Migration plan (expand → backfill → migrate → contract)
1) Expand
```sql
ALTER TABLE loans_dim ADD COLUMN risk_tier text; -- nullable initially
ALTER TABLE loans_dim ADD CONSTRAINT risk_tier_allowed CHECK (risk_tier IN ('A','B','C','D')) NOT VALID;
CREATE INDEX CONCURRENTLY idx_loans_risk_tier ON loans_dim (risk_tier);
```
2) Dual‑write and feature flag
- Application/ETL writes compute risk_tier from risk_score and write both fields.
- Readers access a view providing both old and new columns; consumers switch behind a flag.
3) Backfill in batches to avoid long locks
```sql
-- Pseudocode batching by surrogate key ranges
UPDATE loans_dim
SET risk_tier = CASE
WHEN risk_score >= 760 THEN 'A'
WHEN risk_score >= 700 THEN 'B'
WHEN risk_score >= 640 THEN 'C'
ELSE 'D' END
WHERE id BETWEEN :lo AND :hi
AND risk_tier IS NULL
RETURNING COUNT(*);
```
4) Validate constraint and make it enforcing
```sql
ALTER TABLE loans_dim VALIDATE CONSTRAINT risk_tier_allowed; -- online validation
```
5) Migrate reads
- Flip the feature flag so downstream models read risk_tier.
- Monitor metrics and data checks.
6) Contract
- After stability window, remove dead code/columns if deprecation approved.
Rollback strategy
- Keep old column and dual‑write for one release; a quick rollback is a feature‑flag flip to read the old column.
- Revert migration script prepared; indexes dropped CONCURRENTLY if needed.
Data correctness validation
- Row counts: before/after table counts equal.
- Checksums: hash aggregates by primary key; compare pre/post samples.
```sql
SELECT md5(string_agg(id || ':' || COALESCE(risk_tier,''), ',' ORDER BY id)) FROM loans_dim WHERE id BETWEEN :lo AND :hi;
```
- Invariants: risk_tier ∈ {A,B,C,D}; NOT NULL after migration; distribution within expected ranges.
- Reconciliation: sample 1,000 loans, recompute risk_tier in code, diff with DB values.
- Canary: run downstream model in shadow mode and compare outputs.
# 4) Mentoring and Reviews
What I look for
- Correctness
- Join keys and cardinality; avoid accidental CROSS JOINs.
- Null semantics (COUNT(*) vs COUNT(col), COALESCE, outer joins).
- Window frames/order correctness; avoid unbounded frames when not intended.
- Performance
- Sargability (avoid functions on indexed columns: DATE(ts), LOWER(col)).
- Avoid SELECT * in production; project only needed columns.
- Replace DISTINCT “band‑aids” with correct de‑duplication at the source.
- Eliminate correlated subqueries in hot paths; prefer joins/pre‑aggregation.
- Be wary of CTE materialization costs (engine‑specific); reuse subqueries sensibly.
- Ensure predicates enable partition pruning and index usage.
- Maintainability
- Clear CTE names, comments for business logic, consistent naming and types.
- Deterministic logic and stable ordering in tests.
Recent coaching example
- Original task: Monthly customer retention by cohort. The query used two correlated subqueries per row to find first_purchase_date and prior month activity. Runtime ~35 minutes on ~80M rows; occasional timeouts.
- Changes I proposed:
1) Precompute first_purchase per customer once.
2) Replace correlated subqueries with a window function and a join.
3) Ensure sargable date filters and correct partition pruning.
4) Add a unique key test on (customer_id, month) to prevent duplicates instead of relying on DISTINCT.
Before (simplified):
```sql
SELECT c.customer_id, m.month,
EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
AND DATE(o.created_at) = m.month
) AS active
FROM customers c
JOIN months m ON m.month BETWEEN :start AND :end;
```
After (faster, correct):
```sql
WITH first_purchase AS (
SELECT customer_id, MIN(created_at) AS first_order_at
FROM orders
GROUP BY 1
), activity AS (
SELECT customer_id,
DATE_TRUNC('month', created_at) AS month,
COUNT(*) AS orders
FROM orders
WHERE created_at >= :start AND created_at < :end
GROUP BY 1,2
)
SELECT c.customer_id, m.month,
(a.orders > 0) AS active
FROM customers c
CROSS JOIN months m
LEFT JOIN activity a USING (customer_id, month)
JOIN first_purchase fp ON fp.customer_id = c.customer_id
WHERE m.month >= DATE_TRUNC('month', fp.first_order_at);
```
Results: 35 min → 1 min 40 sec (≈21x faster), no timeouts; correctness improved (no duplicate rows), and logic aligned with product definitions.
Validation patterns I teach
- Row‑count reconciliation at each CTE step.
- Diff against a trusted source: SELECT ... EXCEPT SELECT ...
- Sample‑based manual traces for edge cases.
- Add EXPLAIN plans to PRs and capture before/after metrics.