Build a Payment Fraud Detection Model
Company: Adyen
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: easy
Interview Round: HR Screen
You are interviewing for a **Machine Learning Engineer** role at a FinTech company that processes online card payments. The loop has two technical parts: a short ML-fundamentals discussion, followed by a hands-on coding exercise where you build a runnable fraud-detection pipeline on a payment-transaction dataset.
### Constraints & Assumptions
- **Dataset**: one row per transaction with a binary label `is_fraud`. Features include `amount`, a transaction timestamp, `merchant_category`, `country`, `payment_method`, device attributes, and customer-history aggregates. Some fields contain missing values.
- **Class balance**: fraud is rare, so the data is severely imbalanced. (Confirm the exact base rate with your interviewer; in payment fraud it is typically a small fraction of a percent, which you should reason about explicitly when choosing metrics and handling imbalance.)
- **Tooling**: you may use standard ML libraries (scikit-learn, pandas, XGBoost/LightGBM) and AI coding assistants, but the final code must **run end-to-end** on the provided dataset.
- **Time-ordered data**: transactions span a time range; a model deployed in production is trained on the past and scores the future.
### Clarifying Questions to Ask
Scope the whole exercise before writing code:
- What is the actual fraud base rate, and is the label trustworthy (confirmed chargebacks vs. analyst-flagged)?
- Is there a **label-maturity delay** — i.e., how long after a transaction before its fraud status is final (chargebacks can arrive weeks later)?
- What is the downstream **action**: hard-block, step-up auth, or queue for manual review? And what is the review-team capacity?
- What are the relative costs of a **false negative** (fraud loss) vs. a **false positive** (blocked good customer)?
- Are any features **post-hoc** (computed using information unavailable at scoring time), and which aggregates were built over the whole dataset?
- What latency budget does scoring have at authorization time?
---
### Part 1 — ML fundamentals
Answer the following, with enough precision to show you understand the mechanics, not just the definitions:
- What is **overfitting**?
- How can you **detect** overfitting during model development?
- How do **L1** and **L2** regularization reduce overfitting, and how do they **differ** from each other?
```hint What "differ" is really asking
Tie each penalty to its effect on the weight vector: one tends to produce *sparse* weights (some exactly zero), the other *shrinks* weights smoothly. Think about the geometry of the $\ell_1$ "diamond" vs the $\ell_2$ "ball" constraint region.
```
```hint Connect it back to the task
Overfitting in fraud data has a concrete face: high-cardinality identifiers (customer/device IDs) let the model *memorize* who committed fraud in the train window. Tie your answer to how regularization — or simply not feeding raw IDs — defends against that.
```
#### What This Part Should Cover
- **Mechanism, not definition**: ties overfitting to the bias/variance trade-off and to model capacity relative to the available signal, rather than reciting "fits the training data too well."
- **Concrete detection**: names the train-vs-validation gap as the primary tell and at least one curve-based diagnostic (learning curves over epochs or over dataset size, or cross-validation variance).
- **L1 vs L2 contrast**: gives the sparsity-vs-smooth-shrinkage distinction *and* a reason for it (geometry of the constraint region, or the constant-vs-proportional penalty gradient), not just the two formulas.
### Part 2 — Build a fraud-detection pipeline
Write a runnable ML pipeline that trains and evaluates a fraud-detection model on the dataset above. Your solution must:
- Split the data into **train / validation / test** sets without leaking future information.
- Handle **missing values** and **categorical features**.
- Address the **severe class imbalance**.
- Train at least one reasonable **baseline model**.
- **Evaluate** using metrics appropriate for fraud detection.
- Explain what you would **improve** if you had more time.
```hint Splitting strategy
A random shuffle split leaks the future into the past. What split respects the time order so train < validation < test chronologically? Also ask which features are computed *after* the transaction outcome is known.
```
```hint Imbalance — where it bites
Accuracy is useless when the positive rate is a fraction of a percent. Decide both (a) how you teach the model to care about the rare class (`class_weight`, resampling the *training fold only*, or a cost-sensitive objective) and (b) which metric you optimize.
```
```hint Choosing the metric
Under severe imbalance, ROC-AUC can look flatteringly high because true negatives dominate. Which curve isolates performance on the positive class, and which operating-point metric maps to a real review-capacity constraint?
```
```hint Keep it runnable
The interviewer cares that the code *runs*. A `ColumnTransformer` inside a `Pipeline` makes preprocessing fit on the train fold automatically (leakage-safe) and is far less error-prone under time pressure than a hand-rolled neural network that never trains.
```
#### What This Part Should Cover
- **Preprocessing discipline**: principled, reproducible handling of missing values, categoricals, and the heavy-tailed `amount`, all fit on the training fold only.
- **Deliberate imbalance mechanism**: picks a concrete way to weight the rare class *and* moves the decision threshold off the default 0.5, with a stated reason for each.
- **Justified baseline**: a debuggable model appropriate for tabular data (regularized linear or gradient-boosted trees), not a reach for unnecessary complexity.
- **End-to-end runnability**: the code actually executes on the dataset and demonstrates each required behavior, rather than a sketch that never trains.
---
### What a Strong Answer Covers
Beyond the per-Part rubrics, these dimensions span both parts and most distinguish strong candidates:
- **Problem framing**: recognizes that the rare positive class makes a naive overall-accuracy metric meaningless, and ties the objective to a concrete business goal.
- **Leakage discipline**: the split respects time order *and* the candidate spots the subtler leaks — features knowable only after the outcome, aggregates fit over the whole timeline, any transform or resampling fit outside the training fold. This is the single most discriminating signal across both parts.
- **Evaluation maturity**: metrics are appropriate under severe imbalance (PR-AUC, recall-at-fixed-precision) and connected to a real operating constraint, reported on a held-out set with the threshold selected on validation.
- **Roadmap**: names the highest-leverage next steps (richer behavioral/velocity features, calibration, drift handling, explainability, a labels feedback loop) and explains *why* each matters here.
### Follow-up Questions
- Fraud patterns shift over time. How would you **detect and respond to drift**, and how often would you retrain?
- Chargeback labels arrive with a **weeks-long delay**. How does this affect how you construct training labels, time-based splits, and online evaluation?
- Your scores feed a manual-review queue with capacity for only the top *N* transactions per day. How do you turn a probability into an action, and how do you **calibrate** scores so the cutoff is meaningful?
- How would you measure and mitigate disparate **false-positive rates** across countries or customer segments?
Quick Answer: This question evaluates a candidate's competency in applied machine learning for payment fraud detection, covering data preprocessing, temporal train/validation/test splitting, missing-value and categorical handling, class-imbalance strategies, baseline modeling, and selection of appropriate evaluation metrics.