Design an analytic warehouse for event data
Company: Snowflake
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Onsite
Design a warehouse-ready analytics data model and ingestion plan to support cohort retention, ARPU, and product-case analyses at scale (50M events/day). Assume BigQuery or Snowflake.
Sub-questions:
1) Schema: Propose fact and dimension tables (e.g., fact_events, fact_orders, dim_user, dim_country). Provide DDL-level details: partitioning (by event_date), clustering/sorting keys (user_id, event_type), and surrogate keys. Explain how you’d model slowly changing dimensions (Type 2) for user country and app version.
2) Idempotency & deduplication: Events can arrive late (up to 14 days) and out-of-order with occasional duplicates (same (user_id, event_ts, event_type)). Specify your dedupe key and merge/upsert strategy. How do you reprocess late data without double-counting cohorts? Include a backfill plan.
3) Metrics tables: Define a derived table grain for weekly retention by signup_week and ARPU by cohort. Show the SQL pattern (window PARTITION BY and date bucketing) you’d use and how materialization/incremental build works. Describe data quality checks (e.g., cohort_size non-increasing across week_index, retention ∈ [0,1]).
4) Performance: Estimate table sizes, choose file sizes/micro-partitions, and justify cluster keys for typical queries (top-N countries last 8 weeks, rolling DAU/WAU/MAU). Discuss cost controls (partition pruning, approximate distinct with HLL, result caching).
Quick Answer: This question evaluates a candidate's competency in data warehousing and engineering for analytics, including event-level schema and partitioning, slowly changing dimensions, idempotent ingestion and deduplication, derived metrics modeling, and query performance and cost trade-offs using SQL/Python.