Design an analytic warehouse for event data

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a Medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Onsite rounds at Snowflake.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Snowflake during technical interviews.

Question

Design a warehouse-ready analytics data model and ingestion plan to support cohort retention, ARPU, and product-case analyses at scale (50M events/day). Assume BigQuery or Snowflake.
Sub-questions:
1) Schema: Propose fact and dimension tables (e.g., fact_events, fact_orders, dim_user, dim_country). Provide DDL-level details: partitioning (by event_date), clustering/sorting keys (user_id, event_type), and surrogate keys. Explain how you’d model slowly changing dimensions (Type 2) for user country and app version.
2) Idempotency & deduplication: Events can arrive late (up to 14 days) and out-of-order with occasional duplicates (same (user_id, event_ts, event_type)). Specify your dedupe key and merge/upsert strategy. How do you reprocess late data without double-counting cohorts? Include a backfill plan.
3) Metrics tables: Define a derived table grain for weekly retention by signup_week and ARPU by cohort. Show the SQL pattern (window PARTITION BY and date bucketing) you’d use and how materialization/incremental build works. Describe data quality checks (e.g., cohort_size non-increasing across week_index, retention ∈ [0,1]).
4) Performance: Estimate table sizes, choose file sizes/micro-partitions, and justify cluster keys for typical queries (top-N countries last 8 weeks, rolling DAU/WAU/MAU). Discuss cost controls (partition pruning, approximate distinct with HLL, result caching).

PracHub · Accepted Answer

This question evaluates a candidate's competency in data warehousing and engineering for analytics, including event-level schema and partitioning, slowly changing dimensions, idempotent ingestion and deduplication, derived metrics modeling, and query performance and cost trade-offs using SQL/Python.

Quick Overview

Quick Overview