Design an analytic warehouse for event data

Q: Design an analytic warehouse for event data

This is a Data Manipulation (SQL/Python) interview question from Snowflake for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

Design a warehouse-ready analytics data model and ingestion plan to support cohort retention, ARPU, and product-case analyses at scale (50M events/day). Assume BigQuery or Snowflake. Sub-questions:

Schema: Propose fact and dimension tables (e.g., fact_events, fact_orders, dim_user, dim_country). Provide DDL-level details: partitioning (by event_date), clustering/sorting keys (user_id, event_type), and surrogate keys. Explain how you’d model slowly changing dimensions (Type 2) for user country and app version.
Idempotency & deduplication: Events can arrive late (up to 14 days) and out-of-order with occasional duplicates (same (user_id, event_ts, event_type)). Specify your dedupe key and merge/upsert strategy. How do you reprocess late data without double-counting cohorts? Include a backfill plan.
Metrics tables: Define a derived table grain for weekly retention by signup_week and ARPU by cohort. Show the SQL pattern (window PARTITION BY and date bucketing) you’d use and how materialization/incremental build works. Describe data quality checks (e.g., cohort_size non-increasing across week_index, retention ∈ [0,1]).
Performance: Estimate table sizes, choose file sizes/micro-partitions, and justify cluster keys for typical queries (top-N countries last 8 weeks, rolling DAU/WAU/MAU). Discuss cost controls (partition pruning, approximate distinct with HLL, result caching).

Design an analytic warehouse for event data

Comments (0)