PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates practical skills in idempotent ETL design, deduplication by natural keys, handling late-arriving and out-of-order records, upsert/merge strategies, and SQL-based aggregation for daily metrics.

  • Medium
  • Stripe
  • Data Manipulation (SQL/Python)
  • Data Scientist

Design an idempotent SQL ETL for late data

Company: Stripe

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Technical Screen

You own the daily_user_metrics fact table. Build an idempotent, rerunnable ETL that can be triggered for any date D and correctly handles duplicates, late-arriving records (up to 2 days late), and status changes. You must write whiteboard-level SQL for the core transformation and describe the upsert strategy. Warehouse (UTC). Raw landing schemas and tiny samples: Table: users_dim u_id | created_at | country | is_test ---- | -------------------- | ------- | ------- 1 | 2025-08-28 10:00:00 | US | 0 2 | 2025-08-30 12:00:00 | CA | 0 3 | 2025-08-15 09:00:00 | US | 1 Table: events_raw event_id | u_id | event_type | event_ts | ingested_at | source -------- | ---- | ---------- | ------------------- | -------------------- | ------ e1 | 1 | view | 2025-09-01 02:03:00 | 2025-09-01 02:03:05 | web e1 | 1 | view | 2025-09-01 02:03:00 | 2025-09-02 01:00:00 | replay (duplicate, late) e2 | 1 | purchase | 2025-09-01 03:10:00 | 2025-09-01 03:10:04 | web e3 | 2 | view | 2025-08-31 23:59:59 | 2025-09-01 00:00:01 | web (late arrival for 08-31) Table: orders_raw order_id | u_id | amount | status | order_ts | updated_at | ingested_at -------- | ---- | ------ | -------- | ------------------- | ------------------- | ------------------- o1 | 1 | 100.00 | paid | 2025-09-01 03:10:00 | 2025-09-01 03:12:00 | 2025-09-01 03:12:05 o1 | 1 | 100.00 | refunded | 2025-09-01 03:10:00 | 2025-09-03 09:00:00 | 2025-09-03 09:00:10 (late status change) o2 | 2 | 50.00 | pending | 2025-09-01 20:00:00 | 2025-09-01 20:01:00 | 2025-09-01 20:01:05 Target: daily_user_metrics (dt DATE, u_id BIGINT, first_event_ts TIMESTAMP, events_cnt INT, paid_orders_cnt INT, paid_orders_amt DECIMAL(12,2)). Exclude users_dim.is_test = 1. Tasks (be precise and tricky): 1) Dedup staging logic. Write SQL CTE(s) that deduplicate events_raw by event_id keeping only the row with the max(ingested_at). Do the same for orders_raw by order_id, keeping the row with the max(updated_at) as the authoritative status snapshot. Explain why dedup by natural keys (event_id/order_id) is safer than relying on ROW_NUMBER over (u_id, event_ts) here. 2) Daily metric for D=2025-09-01. Using only SQL, produce one row per non-test user with: first_event_ts on D; events_cnt on D; paid_orders_cnt and paid_orders_amt on D where an order counts only if the latest status (per your dedup) is in ('paid','shipped','completed') and not in ('refunded','canceled'). Show the SELECT that computes these fields using your deduped CTEs. Ensure events are filtered by event_ts between [D 00:00:00, D 23:59:59.999] in UTC. 3) Late data capture window. Assume records for D can arrive up to 2 days late. Describe and write SQL for an incremental approach that, on run date R=D+0, D+1, and D+2, recomputes partitions for [D-2, D] and then MERGEs only the dt=D partition in daily_user_metrics so that late-arriving rows are included exactly once. Show an example MERGE (or INSERT OVERWRITE PARTITION) for dt=D and explain how it remains idempotent on reruns. 4) Guardrails and failure modes. Enumerate at least five edge cases your ETL must handle (e.g., partial writes/transactionality, schema drift adding a nullable column to events_raw, null event_ts vs non-null ingested_at, daylight saving changes if you later switch to country-local days, replayed backfills that resend old event_ids). For each, state the defensive technique (e.g., write-ahead staging + checksum, schema evolution policy, fallback to ingested_at for partition pruning but event_ts for business logic, surrogate partitioning strategy, MERGE with deterministic dedup). 5) Validation. Propose two reconciliation queries: one that compares counts and sums between orders_raw (latest status) and daily_user_metrics for D, and one that detects duplicate event_ids that still leaked into the D partition. Include expected results when using the sample data above.

Quick Answer: This question evaluates practical skills in idempotent ETL design, deduplication by natural keys, handling late-arriving and out-of-order records, upsert/merge strategies, and SQL-based aggregation for daily metrics.

Last updated: Mar 29, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Write SQL to monitor weekly chargeback spikes - Stripe (Medium)
  • Write SQL to detect recurring non-subscription users - Stripe (Medium)
  • Design metrics and write SQL for a case - Stripe (Medium)
  • Write SQL for snapshot features and labels - Stripe (Medium)
  • Design payment-to-invoice matcher with priorities - Stripe (Medium)