Design a leak-free time-split model

Q: Design a leak-free time-split model

This is a Machine Learning interview question from Stripe for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Predict 30-Day Purchase Probability at a Snapshot (Technical Screen)

Assume you have user, event, and order data with two timestamps per row:

event_time: when the user action actually happened (UTC).
arrival_time: when the event was recorded/ingested in analytics (UTC). Some events arrive late, i.e., arrival_time > event_time.

Assume an "active user" is any user with at least one event in the 90 days prior to the snapshot (you may state and use a different defensible definition if you prefer). You will deliver slides and runnable code within one week (~4–6 hours) that scores each active user as of a snapshot and predicts their probability of placing an order in the next 30 days.

Requirements

Snapshot and label definition
- Use snapshot_ts = 2025-09-01 00:00:00 UTC.
- Predict whether an order occurs in the interval [snapshot_ts, snapshot_ts + 30 days).
Leakage prevention
- Specify exactly how you will prevent label leakage, including:
  - Late-arriving events (arrival_time > event_time).
  - Any features computed after snapshot_ts.
Modeling approach
- Propose a minimal but strong baseline and a main model.
- Justify choosing between logistic regression with monotonic/regularized features versus gradient-boosted trees.
- List the top 5 engineered features you would start with and explain why.
Metrics
- Choose the primary optimization metric under class imbalance (e.g., PR-AUC vs ROC-AUC) and the business metric you will report on slides.
- Explain the trade-offs.
Evaluation protocol
- Describe a temporal cross-validation scheme (rolling-origin/blocked) that yields an honest estimate.
- Explain how you will tune hyperparameters quickly without overfitting given the time budget.
Leakage and contamination checks
- Explain how you will detect and mitigate data/label leakage, target leakage, and train–test contamination.
- Include at least two concrete checks you would code.
Calibration and thresholding for email targeting
- Outline your calibration plan (e.g., Platt scaling vs isotonic).
- Propose threshold selection for an email use case with a cost per send.
- Explain how you’ll communicate calibration quality and expected impact in slides.
Time-boxed ablations and slides
- Include a minimal ablation plan you would run within the timebox, including which features/modeling choices you would drop first if time runs short.
- Provide the exact headlines of 5–7 slides that tell a compelling story to a hiring manager.

Design a leak-free time-split model

Predict 30-Day Purchase Probability at a Snapshot (Technical Screen)

Requirements

Solution (Locked)

Comments (0)