PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Stripe

Design a leak-free time-split model

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a Data Scientist's competency in time-aware predictive modeling, including label leakage prevention for late-arriving events, temporal cross-validation, feature engineering, model selection (logistic regression vs gradient-boosted trees), calibration and thresholding for costed actions.

  • hard
  • Stripe
  • Machine Learning
  • Data Scientist

Design a leak-free time-split model

Company: Stripe

Role: Data Scientist

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

You have one week (recommended 4–6 hours of effort) to deliver slides plus runnable code that predicts each active user's probability of making a purchase in the next 30 days as of a snapshot time. Requirements: (1) Define snapshot_ts = 2025-09-01 00:00:00 UTC; predict whether an order occurs in [snapshot_ts, snapshot_ts + 30d). (2) Specify exactly how you will prevent label leakage, including late-arriving events (events whose arrival_time > event_time) and any features computed after snapshot_ts. (3) Propose a minimal but strong baseline and a main model, justify choosing between logistic regression with monotonic/regularized features versus gradient-boosted trees; list the top 5 engineered features you would start with and why. (4) Choose the primary optimization metric under class imbalance (e.g., PR-AUC vs ROC-AUC) and the business metric you will report on slides; explain trade-offs. (5) Describe a temporal cross-validation scheme (rolling-origin/blocked) that produces an honest estimate; include how you will tune hyperparameters quickly without overfitting given the time budget. (6) Explain how you will detect and mitigate data/label leakage, target leakage, and train-test contamination; include at least two concrete checks you would code. (7) Outline your calibration plan (e.g., Platt scaling vs isotonic), threshold selection for an email targeting use-case with a cost per send, and how you'll communicate calibration and expected impact in slides. (8) Include a minimal ablation plan you would run within the timebox (which features/modeling choices you would drop first if time runs short), and the exact contents of 5–7 slide headlines that tell a compelling story to a hiring manager.

Quick Answer: This question evaluates a Data Scientist's competency in time-aware predictive modeling, including label leakage prevention for late-arriving events, temporal cross-validation, feature engineering, model selection (logistic regression vs gradient-boosted trees), calibration and thresholding for costed actions.

Related Interview Questions

  • Normalize targets for multitask regression - Stripe (medium)
  • Design a hierarchical forecast for transactions - Stripe (Medium)
  • Design a model for subscription adoption prediction - Stripe (hard)
  • Design a target‑user prediction system - Stripe (hard)
Stripe logo
Stripe
Oct 13, 2025, 9:49 PM
Data Scientist
Technical Screen
Machine Learning
0
0

Predict 30-Day Purchase Probability at a Snapshot (Technical Screen)

Assume you have user, event, and order data with two timestamps per row:

  • event_time: when the user action actually happened (UTC).
  • arrival_time: when the event was recorded/ingested in analytics (UTC). Some events arrive late, i.e., arrival_time > event_time.

Assume an "active user" is any user with at least one event in the 90 days prior to the snapshot (you may state and use a different defensible definition if you prefer). You will deliver slides and runnable code within one week (~4–6 hours) that scores each active user as of a snapshot and predicts their probability of placing an order in the next 30 days.

Requirements

  1. Snapshot and label definition
    • Use snapshot_ts = 2025-09-01 00:00:00 UTC.
    • Predict whether an order occurs in the interval [snapshot_ts, snapshot_ts + 30 days).
  2. Leakage prevention
    • Specify exactly how you will prevent label leakage, including:
      • Late-arriving events (arrival_time > event_time).
      • Any features computed after snapshot_ts.
  3. Modeling approach
    • Propose a minimal but strong baseline and a main model.
    • Justify choosing between logistic regression with monotonic/regularized features versus gradient-boosted trees.
    • List the top 5 engineered features you would start with and explain why.
  4. Metrics
    • Choose the primary optimization metric under class imbalance (e.g., PR-AUC vs ROC-AUC) and the business metric you will report on slides.
    • Explain the trade-offs.
  5. Evaluation protocol
    • Describe a temporal cross-validation scheme (rolling-origin/blocked) that yields an honest estimate.
    • Explain how you will tune hyperparameters quickly without overfitting given the time budget.
  6. Leakage and contamination checks
    • Explain how you will detect and mitigate data/label leakage, target leakage, and train–test contamination.
    • Include at least two concrete checks you would code.
  7. Calibration and thresholding for email targeting
    • Outline your calibration plan (e.g., Platt scaling vs isotonic).
    • Propose threshold selection for an email use case with a cost per send.
    • Explain how you’ll communicate calibration quality and expected impact in slides.
  8. Time-boxed ablations and slides
    • Include a minimal ablation plan you would run within the timebox, including which features/modeling choices you would drop first if time runs short.
    • Provide the exact headlines of 5–7 slides that tell a compelling story to a hiring manager.

Solution

Show

Submit Your Answer

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Stripe•More Data Scientist•Stripe Data Scientist•Stripe Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.