Design a warehouse-ready analytics data model and ingestion plan to support cohort retention, ARPU, and product-case analyses at scale (50M events/day). Assume BigQuery or Snowflake.
Sub-questions:
-
Schema: Propose fact and dimension tables (e.g., fact_events, fact_orders, dim_user, dim_country). Provide DDL-level details: partitioning (by event_date), clustering/sorting keys (user_id, event_type), and surrogate keys. Explain how you’d model slowly changing dimensions (Type 2) for user country and app version.
-
Idempotency & deduplication: Events can arrive late (up to 14 days) and out-of-order with occasional duplicates (same (user_id, event_ts, event_type)). Specify your dedupe key and merge/upsert strategy. How do you reprocess late data without double-counting cohorts? Include a backfill plan.
-
Metrics tables: Define a derived table grain for weekly retention by signup_week and ARPU by cohort. Show the SQL pattern (window PARTITION BY and date bucketing) you’d use and how materialization/incremental build works. Describe data quality checks (e.g., cohort_size non-increasing across week_index, retention ∈ [0,1]).
-
Performance: Estimate table sizes, choose file sizes/micro-partitions, and justify cluster keys for typical queries (top-N countries last 8 weeks, rolling DAU/WAU/MAU). Discuss cost controls (partition pruning, approximate distinct with HLL, result caching).