How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a easy difficulty Machine Learning question, commonly asked during HR Screen rounds at Adyen.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Adyen during technical interviews.

Build a Payment Fraud Detection Model | Adyen Interview Question

Q: Build a Payment Fraud Detection Model

This question evaluates a candidate's competency in applied machine learning for payment fraud detection, covering data preprocessing, temporal train/validation/test splitting, missing-value and categorical handling, class-imbalance strategies, baseline modeling, and selection of appropriate evaluation metrics.

You are interviewing for a Machine Learning Engineer role at a FinTech company that processes online card payments. The loop has two technical parts: a short ML-fundamentals discussion, followed by a hands-on coding exercise where you build a runnable fraud-detection pipeline on a payment-transaction dataset.

Constraints & Assumptions

Dataset : one row per transaction with a binary label is_fraud . Features include amount , a transaction timestamp, merchant_category , country , payment_method , device attributes, and customer-history aggregates. Some fields contain missing values.
Class balance : fraud is rare, so the data is severely imbalanced. (Confirm the exact base rate with your interviewer; in payment fraud it is typically a small fraction of a percent, which you should reason about explicitly when choosing metrics and handling imbalance.)
Tooling : you may use standard ML libraries (scikit-learn, pandas, XGBoost/LightGBM) and AI coding assistants, but the final code must run end-to-end on the provided dataset.
Time-ordered data : transactions span a time range; a model deployed in production is trained on the past and scores the future.

Clarifying Questions to Ask

Scope the whole exercise before writing code:

What is the actual fraud base rate, and is the label trustworthy (confirmed chargebacks vs. analyst-flagged)?
Is there a label-maturity delay — i.e., how long after a transaction before its fraud status is final (chargebacks can arrive weeks later)?
What is the downstream action : hard-block, step-up auth, or queue for manual review? And what is the review-team capacity?
What are the relative costs of a false negative (fraud loss) vs. a false positive (blocked good customer)?
Are any features post-hoc (computed using information unavailable at scoring time), and which aggregates were built over the whole dataset?
What latency budget does scoring have at authorization time?

Part 1 — ML fundamentals

Answer the following, with enough precision to show you understand the mechanics, not just the definitions:

What is overfitting ?
How can you detect overfitting during model development?
How do L1 and L2 regularization reduce overfitting, and how do they differ from each other?

What This Part Should Cover

Mechanism, not definition : ties overfitting to the bias/variance trade-off and to model capacity relative to the available signal, rather than reciting "fits the training data too well."
Concrete detection : names the train-vs-validation gap as the primary tell and at least one curve-based diagnostic (learning curves over epochs or over dataset size, or cross-validation variance).
L1 vs L2 contrast : gives the sparsity-vs-smooth-shrinkage distinction and a reason for it (geometry of the constraint region, or the constant-vs-proportional penalty gradient), not just the two formulas.

Part 2 — Build a fraud-detection pipeline

Write a runnable ML pipeline that trains and evaluates a fraud-detection model on the dataset above. Your solution must:

Split the data into train / validation / test sets without leaking future information.
Handle missing values and categorical features .
Address the severe class imbalance .
Train at least one reasonable baseline model .
Evaluate using metrics appropriate for fraud detection.
Explain what you would improve if you had more time.

What This Part Should Cover

Preprocessing discipline : principled, reproducible handling of missing values, categoricals, and the heavy-tailed amount , all fit on the training fold only.
Deliberate imbalance mechanism : picks a concrete way to weight the rare class and moves the decision threshold off the default 0.5, with a stated reason for each.
Justified baseline : a debuggable model appropriate for tabular data (regularized linear or gradient-boosted trees), not a reach for unnecessary complexity.
End-to-end runnability : the code actually executes on the dataset and demonstrates each required behavior, rather than a sketch that never trains.

What a Strong Answer Covers

Beyond the per-Part rubrics, these dimensions span both parts and most distinguish strong candidates:

Problem framing : recognizes that the rare positive class makes a naive overall-accuracy metric meaningless, and ties the objective to a concrete business goal.
Leakage discipline : the split respects time order and the candidate spots the subtler leaks — features knowable only after the outcome, aggregates fit over the whole timeline, any transform or resampling fit outside the training fold. This is the single most discriminating signal across both parts.
Evaluation maturity : metrics are appropriate under severe imbalance (PR-AUC, recall-at-fixed-precision) and connected to a real operating constraint, reported on a held-out set with the threshold selected on validation.
Roadmap : names the highest-leverage next steps (richer behavioral/velocity features, calibration, drift handling, explainability, a labels feedback loop) and explains why each matters here.

Follow-up Questions

Fraud patterns shift over time. How would you detect and respond to drift , and how often would you retrain?
Chargeback labels arrive with a weeks-long delay . How does this affect how you construct training labels, time-based splits, and online evaluation?
Your scores feed a manual-review queue with capacity for only the top N transactions per day. How do you turn a probability into an action, and how do you calibrate scores so the cutoff is meaningful?
How would you measure and mitigate disparate false-positive rates across countries or customer segments?

Constraints & Assumptions

Dataset : one row per transaction with a binary label is_fraud . Features include amount , a transaction timestamp, merchant_category , country , payment_method , device attributes, and customer-history aggregates. Some fields contain missing values.
Class balance : fraud is rare, so the data is severely imbalanced. (Confirm the exact base rate with your interviewer; in payment fraud it is typically a small fraction of a percent, which you should reason about explicitly when choosing metrics and handling imbalance.)
Tooling : you may use standard ML libraries (scikit-learn, pandas, XGBoost/LightGBM) and AI coding assistants, but the final code must run end-to-end on the provided dataset.
Time-ordered data : transactions span a time range; a model deployed in production is trained on the past and scores the future.

Clarifying Questions to Ask

Scope the whole exercise before writing code:

What is the actual fraud base rate, and is the label trustworthy (confirmed chargebacks vs. analyst-flagged)?
Is there a label-maturity delay — i.e., how long after a transaction before its fraud status is final (chargebacks can arrive weeks later)?
What is the downstream action : hard-block, step-up auth, or queue for manual review? And what is the review-team capacity?
What are the relative costs of a false negative (fraud loss) vs. a false positive (blocked good customer)?
Are any features post-hoc (computed using information unavailable at scoring time), and which aggregates were built over the whole dataset?
What latency budget does scoring have at authorization time?

Part 1 — ML fundamentals

Answer the following, with enough precision to show you understand the mechanics, not just the definitions:

What is overfitting ?
How can you detect overfitting during model development?
How do L1 and L2 regularization reduce overfitting, and how do they differ from each other?

What This Part Should Cover

Mechanism, not definition : ties overfitting to the bias/variance trade-off and to model capacity relative to the available signal, rather than reciting "fits the training data too well."
Concrete detection : names the train-vs-validation gap as the primary tell and at least one curve-based diagnostic (learning curves over epochs or over dataset size, or cross-validation variance).
L1 vs L2 contrast : gives the sparsity-vs-smooth-shrinkage distinction and a reason for it (geometry of the constraint region, or the constant-vs-proportional penalty gradient), not just the two formulas.

Part 2 — Build a fraud-detection pipeline

Write a runnable ML pipeline that trains and evaluates a fraud-detection model on the dataset above. Your solution must:

Split the data into train / validation / test sets without leaking future information.
Handle missing values and categorical features .
Address the severe class imbalance .
Train at least one reasonable baseline model .
Evaluate using metrics appropriate for fraud detection.
Explain what you would improve if you had more time.

What This Part Should Cover

Preprocessing discipline : principled, reproducible handling of missing values, categoricals, and the heavy-tailed amount , all fit on the training fold only.
Deliberate imbalance mechanism : picks a concrete way to weight the rare class and moves the decision threshold off the default 0.5, with a stated reason for each.
Justified baseline : a debuggable model appropriate for tabular data (regularized linear or gradient-boosted trees), not a reach for unnecessary complexity.
End-to-end runnability : the code actually executes on the dataset and demonstrates each required behavior, rather than a sketch that never trains.

What a Strong Answer Covers

Beyond the per-Part rubrics, these dimensions span both parts and most distinguish strong candidates:

Problem framing : recognizes that the rare positive class makes a naive overall-accuracy metric meaningless, and ties the objective to a concrete business goal.
Leakage discipline : the split respects time order and the candidate spots the subtler leaks — features knowable only after the outcome, aggregates fit over the whole timeline, any transform or resampling fit outside the training fold. This is the single most discriminating signal across both parts.
Evaluation maturity : metrics are appropriate under severe imbalance (PR-AUC, recall-at-fixed-precision) and connected to a real operating constraint, reported on a held-out set with the threshold selected on validation.
Roadmap : names the highest-leverage next steps (richer behavioral/velocity features, calibration, drift handling, explainability, a labels feedback loop) and explains why each matters here.

Follow-up Questions

Fraud patterns shift over time. How would you detect and respond to drift , and how often would you retrain?
Chargeback labels arrive with a weeks-long delay . How does this affect how you construct training labels, time-based splits, and online evaluation?
Your scores feed a manual-review queue with capacity for only the top N transactions per day. How do you turn a probability into an action, and how do you calibrate scores so the cutoff is meaningful?
How would you measure and mitigate disparate false-positive rates across countries or customer segments?

Build a Payment Fraud Detection Model

Quick Overview

Build a Payment Fraud Detection Model

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — ML fundamentals

What This Part Should Cover

Part 2 — Build a fraud-detection pipeline

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Write your answer

Build a Payment Fraud Detection Model

Quick Overview

Build a Payment Fraud Detection Model

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — ML fundamentals

What This Part Should Cover

Part 2 — Build a fraud-detection pipeline

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Write your answer