Design an idempotent churn ETL pipeline

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a medium difficulty Data Manipulation (SQL/Python) question, commonly asked during HR Screen rounds at Intuit.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Intuit during technical interviews.

Question

You must build a daily pipeline that produces month-end churn metrics (logo churn, gross revenue churn, net revenue retention) from streaming subscription events with late arrivals (up to T+3 days). Requirements: idempotent runs, backfills for the past 12 months, slowly changing dimensions (SCD Type 2) for plan changes, data quality checks, and reproducibility.

1) Outline the DAG (tasks, dependencies) from raw events to curated snapshots to metrics. Specify partitioning, clustering, and keys.
2) Provide SQL for an upsert/merge that constructs a monthly snapshot table from event-level changes, correctly handling late-arriving cancels/reactivations and preventing double-application on reruns.
3) Describe a strategy to recompute only affected months when late data arrives (e.g., incremental backfill windows, watermarks) and how you’d validate the recomputation via invariants.
4) Show how you’d version metric definitions so historical reports remain interpretable when definitions evolve (e.g., semantic layer with versioned views).

PracHub · Accepted Answer

This question evaluates a candidate's competence in designing robust ETL pipelines, covering idempotent processing, handling late-arriving streaming subscription events, SCD Type 2 slowly changing dimensions, incremental backfills, data quality checks, upsert/merge semantics, and metric versioning.

Quick Overview

Quick Overview