You must build a daily pipeline that produces month-end churn metrics (logo churn, gross revenue churn, net revenue retention) from streaming subscription events with late arrivals (up to T+3 days). Requirements: idempotent runs, backfills for the past 12 months, slowly changing dimensions (SCD Type 2) for plan changes, data quality checks, and reproducibility.
-
Outline the DAG (tasks, dependencies) from raw events to curated snapshots to metrics. Specify partitioning, clustering, and keys.
-
Provide SQL for an upsert/merge that constructs a monthly snapshot table from event-level changes, correctly handling late-arriving cancels/reactivations and preventing double-application on reruns.
-
Describe a strategy to recompute only affected months when late data arrives (e.g., incremental backfill windows, watermarks) and how you’d validate the recomputation via invariants.
-
Show how you’d version metric definitions so historical reports remain interpretable when definitions evolve (e.g., semantic layer with versioned views).