Design a reproducible data pipeline for modeling

Q: Design a reproducible data pipeline for modeling

This is a Data Manipulation (SQL/Python) interview question from Capital One for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

You receive raw clickstream events and a user table. Build a reproducible daily pipeline that outputs user-day features for modeling. It must be idempotent, handle late-arriving/out-of-order events, include quality checks and unit tests, and support backfills.

Schema and small samples: users user_id | signup_date | country u1 | 2025-05-20 | US u2 | 2025-06-02 | CA u3 | 2025-06-15 | US

events event_id | user_id | ts | event_type | amount_usd | source e1 | u1 | 2025-06-01T12:00:00Z | page_view | null | ads e2 | u1 | 2025-06-01T12:10:00Z | add_to_cart | null | ads e3 | u1 | 2025-06-01T12:45:00Z | purchase | 30.0 | ads e4 | u2 | 2025-06-02T00:05:00Z | page_view | null | seo e5 | u2 | 2025-06-02T00:40:00Z | purchase | 20.0 | seo

Tasks:

In SQL or Python, define transformations to produce user-day features: sessions (30-min inactivity gap), total_page_views, add_to_cart_count, purchases_count, revenue_usd, days_since_signup, country, and top_source for the day; dedupe by event_id; ensure UTC→date bucketing is correct.
Describe how you guarantee idempotency and correctness with late/out-of-order data (e.g., watermarking, upserts/merge, partition overwrite vs. append-only with versioning).
Specify a backfill plan for 2025-06-01 to 2025-08-31, including how you would re-run only affected partitions safely.
Propose concrete data-quality checks (row-count reconciliations, not-null/valid-set for event_type, nonnegative revenue) and two unit tests that would have caught common bugs.
Outline orchestration (DAG tasks and dependencies), storage formats/partitioning, and how you would expose the output for both training and online inference.

Design a reproducible data pipeline for modeling

Comments (0)