Predict 30-Day Purchase Probability at a Snapshot (Technical Screen)
Assume you have user, event, and order data with two timestamps per row:
-
event_time: when the user action actually happened (UTC).
-
arrival_time: when the event was recorded/ingested in analytics (UTC). Some events arrive late, i.e., arrival_time > event_time.
Assume an "active user" is any user with at least one event in the 90 days prior to the snapshot (you may state and use a different defensible definition if you prefer). You will deliver slides and runnable code within one week (~4–6 hours) that scores each active user as of a snapshot and predicts their probability of placing an order in the next 30 days.
Requirements
-
Snapshot and label definition
-
Use snapshot_ts = 2025-09-01 00:00:00 UTC.
-
Predict whether an order occurs in the interval [snapshot_ts, snapshot_ts + 30 days).
-
Leakage prevention
-
Specify exactly how you will prevent label leakage, including:
-
Late-arriving events (arrival_time > event_time).
-
Any features computed after snapshot_ts.
-
Modeling approach
-
Propose a minimal but strong baseline and a main model.
-
Justify choosing between logistic regression with monotonic/regularized features versus gradient-boosted trees.
-
List the top 5 engineered features you would start with and explain why.
-
Metrics
-
Choose the primary optimization metric under class imbalance (e.g., PR-AUC vs ROC-AUC) and the business metric you will report on slides.
-
Explain the trade-offs.
-
Evaluation protocol
-
Describe a temporal cross-validation scheme (rolling-origin/blocked) that yields an honest estimate.
-
Explain how you will tune hyperparameters quickly without overfitting given the time budget.
-
Leakage and contamination checks
-
Explain how you will detect and mitigate data/label leakage, target leakage, and train–test contamination.
-
Include at least two concrete checks you would code.
-
Calibration and thresholding for email targeting
-
Outline your calibration plan (e.g., Platt scaling vs isotonic).
-
Propose threshold selection for an email use case with a cost per send.
-
Explain how you’ll communicate calibration quality and expected impact in slides.
-
Time-boxed ablations and slides
-
Include a minimal ablation plan you would run within the timebox, including which features/modeling choices you would drop first if time runs short.
-
Provide the exact headlines of 5–7 slides that tell a compelling story to a hiring manager.