Design an idempotent churn ETL pipeline
Company: Intuit
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: HR Screen
You must build a daily pipeline that produces month-end churn metrics (logo churn, gross revenue churn, net revenue retention) from streaming subscription events with late arrivals (up to T+3 days). Requirements: idempotent runs, backfills for the past 12 months, slowly changing dimensions (SCD Type 2) for plan changes, data quality checks, and reproducibility.
1) Outline the DAG (tasks, dependencies) from raw events to curated snapshots to metrics. Specify partitioning, clustering, and keys.
2) Provide SQL for an upsert/merge that constructs a monthly snapshot table from event-level changes, correctly handling late-arriving cancels/reactivations and preventing double-application on reruns.
3) Describe a strategy to recompute only affected months when late data arrives (e.g., incremental backfill windows, watermarks) and how you’d validate the recomputation via invariants.
4) Show how you’d version metric definitions so historical reports remain interpretable when definitions evolve (e.g., semantic layer with versioned views).
Quick Answer: This question evaluates a candidate's competence in designing robust ETL pipelines, covering idempotent processing, handling late-arriving streaming subscription events, SCD Type 2 slowly changing dimensions, incremental backfills, data quality checks, upsert/merge semantics, and metric versioning.