Design a reproducible data pipeline for modeling

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Technical Screen rounds at Capital One.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Capital One during technical interviews.

Question

You receive raw clickstream events and a user table. Build a reproducible daily pipeline that outputs user-day features for modeling. It must be idempotent, handle late-arriving/out-of-order events, include quality checks and unit tests, and support backfills.

Schema and small samples:
users
user_id | signup_date | country
u1 | 2025-05-20 | US
u2 | 2025-06-02 | CA
u3 | 2025-06-15 | US

events
event_id | user_id | ts | event_type | amount_usd | source
e1 | u1 | 2025-06-01T12:00:00Z | page_view | null | ads
e2 | u1 | 2025-06-01T12:10:00Z | add_to_cart | null | ads
e3 | u1 | 2025-06-01T12:45:00Z | purchase | 30.0 | ads
e4 | u2 | 2025-06-02T00:05:00Z | page_view | null | seo
e5 | u2 | 2025-06-02T00:40:00Z | purchase | 20.0 | seo

Tasks:
1) In SQL or Python, define transformations to produce user-day features: sessions (30-min inactivity gap), total_page_views, add_to_cart_count, purchases_count, revenue_usd, days_since_signup, country, and top_source for the day; dedupe by event_id; ensure UTC→date bucketing is correct.
2) Describe how you guarantee idempotency and correctness with late/out-of-order data (e.g., watermarking, upserts/merge, partition overwrite vs. append-only with versioning).
3) Specify a backfill plan for 2025-06-01 to 2025-08-31, including how you would re-run only affected partitions safely.
4) Propose concrete data-quality checks (row-count reconciliations, not-null/valid-set for event_type, nonnegative revenue) and two unit tests that would have caught common bugs.
5) Outline orchestration (DAG tasks and dependencies), storage formats/partitioning, and how you would expose the output for both training and online inference.

PracHub · Accepted Answer

This question evaluates a candidate's ability to design reproducible, idempotent data pipelines and feature engineering workflows using SQL and Python, including handling late or out-of-order events, deduplication, backfills, data-quality checks, and orchestration.

Quick Overview

Quick Overview