PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches

Quick Overview

This question evaluates a candidate's ability to design reproducible, idempotent data pipelines and feature engineering workflows using SQL and Python, including handling late or out-of-order events, deduplication, backfills, data-quality checks, and orchestration.

  • Medium
  • Capital One
  • Data Manipulation (SQL/Python)
  • Data Scientist

Design a reproducible data pipeline for modeling

Company: Capital One

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Technical Screen

You receive raw clickstream events and a user table. Build a reproducible daily pipeline that outputs user-day features for modeling. It must be idempotent, handle late-arriving/out-of-order events, include quality checks and unit tests, and support backfills. Schema and small samples: users user_id | signup_date | country u1 | 2025-05-20 | US u2 | 2025-06-02 | CA u3 | 2025-06-15 | US events event_id | user_id | ts | event_type | amount_usd | source e1 | u1 | 2025-06-01T12:00:00Z | page_view | null | ads e2 | u1 | 2025-06-01T12:10:00Z | add_to_cart | null | ads e3 | u1 | 2025-06-01T12:45:00Z | purchase | 30.0 | ads e4 | u2 | 2025-06-02T00:05:00Z | page_view | null | seo e5 | u2 | 2025-06-02T00:40:00Z | purchase | 20.0 | seo Tasks: 1) In SQL or Python, define transformations to produce user-day features: sessions (30-min inactivity gap), total_page_views, add_to_cart_count, purchases_count, revenue_usd, days_since_signup, country, and top_source for the day; dedupe by event_id; ensure UTC→date bucketing is correct. 2) Describe how you guarantee idempotency and correctness with late/out-of-order data (e.g., watermarking, upserts/merge, partition overwrite vs. append-only with versioning). 3) Specify a backfill plan for 2025-06-01 to 2025-08-31, including how you would re-run only affected partitions safely. 4) Propose concrete data-quality checks (row-count reconciliations, not-null/valid-set for event_type, nonnegative revenue) and two unit tests that would have caught common bugs. 5) Outline orchestration (DAG tasks and dependencies), storage formats/partitioning, and how you would expose the output for both training and online inference.

Quick Answer: This question evaluates a candidate's ability to design reproducible, idempotent data pipelines and feature engineering workflows using SQL and Python, including handling late or out-of-order events, deduplication, backfills, data-quality checks, and orchestration.

Last updated: Mar 29, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Clean and Merge Housing Data - Capital One (easy)
  • Find Lowest Prices for Highly Rated Categories - Capital One (medium)
  • Write SQL to compute campaign net revenue - Capital One (Medium)
  • Merge CSVs and build revenue pivot with pandas - Capital One (Medium)
  • Find top category per region in Aug 2025 - Capital One (Medium)