You receive raw clickstream events and a user table. Build a reproducible daily pipeline that outputs user-day features for modeling. It must be idempotent, handle late-arriving/out-of-order events, include quality checks and unit tests, and support backfills.
Schema and small samples:
users
user_id | signup_date | country
u1 | 2025-05-20 | US
u2 | 2025-06-02 | CA
u3 | 2025-06-15 | US
events
event_id | user_id | ts | event_type | amount_usd | source
e1 | u1 | 2025-06-01T12:00:00Z | page_view | null | ads
e2 | u1 | 2025-06-01T12:10:00Z | add_to_cart | null | ads
e3 | u1 | 2025-06-01T12:45:00Z | purchase | 30.0 | ads
e4 | u2 | 2025-06-02T00:05:00Z | page_view | null | seo
e5 | u2 | 2025-06-02T00:40:00Z | purchase | 20.0 | seo
Tasks:
-
In SQL or Python, define transformations to produce user-day features: sessions (30-min inactivity gap), total_page_views, add_to_cart_count, purchases_count, revenue_usd, days_since_signup, country, and top_source for the day; dedupe by event_id; ensure UTC→date bucketing is correct.
-
Describe how you guarantee idempotency and correctness with late/out-of-order data (e.g., watermarking, upserts/merge, partition overwrite vs. append-only with versioning).
-
Specify a backfill plan for 2025-06-01 to 2025-08-31, including how you would re-run only affected partitions safely.
-
Propose concrete data-quality checks (row-count reconciliations, not-null/valid-set for event_type, nonnegative revenue) and two unit tests that would have caught common bugs.
-
Outline orchestration (DAG tasks and dependencies), storage formats/partitioning, and how you would expose the output for both training and online inference.