Design a reproducible data pipeline for modeling
Company: Capital One
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
You receive raw clickstream events and a user table. Build a reproducible daily pipeline that outputs user-day features for modeling. It must be idempotent, handle late-arriving/out-of-order events, include quality checks and unit tests, and support backfills.
Schema and small samples:
users
user_id | signup_date | country
u1 | 2025-05-20 | US
u2 | 2025-06-02 | CA
u3 | 2025-06-15 | US
events
event_id | user_id | ts | event_type | amount_usd | source
e1 | u1 | 2025-06-01T12:00:00Z | page_view | null | ads
e2 | u1 | 2025-06-01T12:10:00Z | add_to_cart | null | ads
e3 | u1 | 2025-06-01T12:45:00Z | purchase | 30.0 | ads
e4 | u2 | 2025-06-02T00:05:00Z | page_view | null | seo
e5 | u2 | 2025-06-02T00:40:00Z | purchase | 20.0 | seo
Tasks:
1) In SQL or Python, define transformations to produce user-day features: sessions (30-min inactivity gap), total_page_views, add_to_cart_count, purchases_count, revenue_usd, days_since_signup, country, and top_source for the day; dedupe by event_id; ensure UTC→date bucketing is correct.
2) Describe how you guarantee idempotency and correctness with late/out-of-order data (e.g., watermarking, upserts/merge, partition overwrite vs. append-only with versioning).
3) Specify a backfill plan for 2025-06-01 to 2025-08-31, including how you would re-run only affected partitions safely.
4) Propose concrete data-quality checks (row-count reconciliations, not-null/valid-set for event_type, nonnegative revenue) and two unit tests that would have caught common bugs.
5) Outline orchestration (DAG tasks and dependencies), storage formats/partitioning, and how you would expose the output for both training and online inference.
Quick Answer: This question evaluates a candidate's ability to design reproducible, idempotent data pipelines and feature engineering workflows using SQL and Python, including handling late or out-of-order events, deduplication, backfills, data-quality checks, and orchestration.