This question evaluates a Machine Learning Engineer's competency in end-to-end ML system design for real-time payments fraud detection, including labeling under delayed confirmations, handling extreme class imbalance and sampling, feature engineering across behavioral, graph, device and merchant signals, model selection for latency and scale, and production scoring and monitoring architecture. It is commonly asked in the ML System Design category to assess how an engineer balances low-latency decision-making with delayed sparse labels, calibration and threshold trade-offs, operational scalability and resiliency, and drift/adversarial detection, testing both conceptual understanding and practical application.
Design an end-to-end fraud detection system. Specify positive/negative labeling strategy given delayed and scarce fraud confirmations, sampling to address extreme class imbalance, feature sets (behavioral, graph, device, merchant), model choices and justification, real-time scoring architecture and latency constraints, thresholding and precision/recall trade-offs, evaluation metrics (PR-AUC, precision@k, cost-sensitive metrics), and monitoring for drift and adversarial adaptation.
Quick Answer: This question evaluates a Machine Learning Engineer's competency in end-to-end ML system design for real-time payments fraud detection, including labeling under delayed confirmations, handling extreme class imbalance and sampling, feature engineering across behavioral, graph, device and merchant signals, model selection for latency and scale, and production scoring and monitoring architecture. It is commonly asked in the ML System Design category to assess how an engineer balances low-latency decision-making with delayed sparse labels, calibration and threshold trade-offs, operational scalability and resiliency, and drift/adversarial detection, testing both conceptual understanding and practical application.
hardMachine Learning EngineerOnsiteML System Design
12
0
Design an End-to-End Real-Time Payments Fraud Detection System
You are a Machine Learning Engineer at a large online payments platform. Design a traditional ML fraud detection system that issues an approve / review / decline decision synchronously at authorization time, then learns from confirmed-fraud labels (chargebacks, disputes, investigations) that arrive weeks later and are scarce.
This is a "classic" ML system design problem — deep neural sequence models and graph learning may appear as components, but the spine is a calibrated, low-latency tabular pipeline. Work through the design end to end, justifying each choice against the latency, label-delay, and class-imbalance constraints.
Constraints & Assumptions
Decision point:
in-line with the payment authorization flow — the score gates whether the charge is attempted, so the model's inference must fit a single-digit-to-low-tens-of-ms budget inside an end-to-end P99 on the order of tens of milliseconds.
Label delay:
the true label of a transaction may not be known for 30–90 days (the chargeback/dispute cycle), and some fraud is never confirmed.
Class imbalance:
transaction-level fraud prevalence is typically well under 1%.
Non-stationarity:
the environment is adversarial; fraudsters adapt continuously, so signal decays.
Train/serve consistency:
features computed offline for training must equal features computed online at decision time.
Treat exact SLAs, prevalence, and cost figures as values you would negotiate with the business; state your assumptions rather than inventing precise numbers.
Clarifying Questions to Ask
What is the
cost asymmetry
— what does an approved fraudulent transaction cost (loss net of recovery) versus a wrongly declined good transaction (lost margin + customer-experience damage), and what does a manual review cost?
Is there a
human review queue
, and what is its capacity and SLA? (This determines whether "review" is even an available action.)
What is the actual
latency SLO
at authorization, and how is the budget split across the auth flow?
What
label sources
exist (network chargeback reason codes, first-party fraud reports, manual investigation verdicts) and how reliable/timely is each?
What is the
regulatory / adverse-action
context (e.g. need for reason codes, restrictions on protected attributes)?
Do we have an
exploration budget
to approve a small fraction of would-be-declines for unbiased labels?
Part 1 — Labeling under delayed, scarce confirmations
Define how you turn raw transactions into training labels when confirmed fraud arrives weeks late. Cover how you assign positive vs. negative labels, the role of observation/maturity windows, how you handle disputed or undetermined outcomes, and how you avoid target leakage.
What This Part Should Cover
A precise positive definition (which confirmed signals, within what window from
t0
) and a "fully matured, no signal" negative definition.
Explicit treatment of immature/disputed transactions (not silently labeled negative).
The label-freshness vs. label-maturity tension, and a concrete way to react faster (weak/early-proxy positives) without corrupting the clean label set.
Concrete anti-leakage discipline: time-based splits and point-in-time feature construction.
Part 2 — Sampling under extreme class imbalance
Describe your offline training strategy for a sub-1% positive rate and what (if anything) you do differently at serving time. Be explicit about how you keep probabilities meaningful.
What This Part Should Cover
A tractable scheme (downsample matured negatives, keep all positives and hard negatives, preserve the time distribution).
An explicit mechanism to recover true-prior, calibrated probabilities (logit prior-correction and/or post-hoc calibration), and the distinction from loss reweighting.
Selection-bias awareness (only-approved-transactions-have-labels) and a mitigation.
That sampling is training-only — serving scores every event on the true distribution.
Part 3 — Feature sets
Enumerate the feature families you would build and how they are served. Cover behavioral/velocity, graph/link, device/network, and merchant/context signals.
Graph/link: identity graph across accounts/cards/devices/IPs and derived risk (neighbor fraud, component size, diffusion), with a note that traversal is too slow in-line.
Merchant/context: MCC/merchant chargeback rate, amount-vs-history, 3DS/AVS/CVV outcomes; plus serving/hygiene (online vs. offline store consistency, PII handling, freshness SLAs).
Part 4 — Model choices and justification
Recommend a baseline and the advanced components, justified against latency and scale. Address how graphs, sequences, and semi-/weak supervision fit in without breaking the latency budget.
What This Part Should Cover
A justified primary model (GBDT) and why it fits latency/scale/explainability.
How advanced signals enter (graph/sequence embeddings as nearline features; entity embeddings for high cardinality) without in-line traversal.
A serving strategy that bounds latency (cascade / two-stage) and a rule layer for hard blocks and fallback.
Use of weak/semi-supervision and active learning to grow labels.
Part 5 — Real-time scoring architecture and latency
Lay out the end-to-end serving architecture and the latency budget. Cover event ingestion, the online/offline feature store split, streaming aggregations, model serving, and fallbacks; give an illustrative P99 budget and how you keep the system resilient.
What This Part Should Cover
The data path: event bus with idempotency, streaming aggregations, online + offline feature stores kept skew-free, stateless model serving.
An illustrative P99 allocation, framed as a planning target to be negotiated against the SLO (and not a strict additive sum).
Fallbacks and resiliency: degraded-feature handling, model-down fallback to rules, fail-open/closed by risk segment, shadow/blue-green rollout.
Part 6 — Thresholding and precision/recall trade-offs
Turn calibrated scores into the approve / review / decline policy. Show how thresholds come from the cost structure (not guesses), and how a review queue and segment-specific costs change the cutoffs.
What This Part Should Cover
A cost-matrix-derived two-action threshold and why amount-aware thresholds beat a single global cutoff.
A three-action policy with a review band, set by expected-profit/queue-capacity reasoning, and queue prioritization by expected loss (probability × amount).
Per-segment thresholds and surfacing the PR curve so operators pick an operating point.
Part 7 — Evaluation metrics
Specify the offline and online metrics, why they suit extreme imbalance, and — critically — how you evaluate honestly given delayed labels and the fact that historical declines have no observed outcome.
What This Part Should Cover
Imbalance-aware ranking and operating-point metrics (PR-AUC, precision/recall@k) plus amount-weighted cost/profit on the business cost matrix.
Evaluating only on
matured
label windows and never scoring a too-recent window.
Off-policy / counterfactual evaluation (IPW / doubly-robust, anchored by an exploration holdout) and decision-aware backtesting under queue constraints.
Part 8 — Monitoring for drift and adversarial adaptation
Describe what you monitor before labels mature, how you detect concept drift and adversarial behavior, and your retraining/rollout guardrails.
What This Part Should Cover
Pre-label health signals (feature availability/freshness, input and score drift, early label proxies).
Concept-drift response: retraining cadence on rolling windows plus fast hotfix retrains, nearline refresh of velocity/graph features.
Adversarial monitoring (surge/graph-anomaly detectors) and reversible responses (friction/step-up auth) rather than only hard declines.
Across all parts, a strong answer treats fraud detection as a cost-minimization decision system, not an accuracy benchmark, and keeps three cross-cutting threads coherent end to end:
The decision objective is dollars, not AUC
— every threshold, metric, and sampling choice traces back to a cost matrix (fraud loss vs. declined-good-customer cost vs. review cost), and high-value transactions are held to a stricter bar.
Calibration is the connective tissue
— labeling, sampling/prior-correction, model objective, thresholding, and evaluation are linked by the requirement that scores be true-prior-calibrated probabilities; the candidate should show
why
each stage preserves that.
Train/serve consistency and honest evaluation under delayed, biased labels
— point-in-time features, time-based splits, matured-only evaluation, and off-policy correction recur across labeling, architecture, and metrics; a candidate who only optimizes offline AUC misses the problem.
Operational realism and adversarial robustness
— latency budgets, fallbacks, a rule layer, reversible friction, and monitoring that front-runs 90-day labels distinguish a deployable design from a notebook model.
Follow-up Questions
How would you formulate setting a
per-account spending limit
as a reinforcement-learning problem (state, action, reward) — and why is RL a poor fit for the in-line approve/decline decision in Part 6?
Could a
large language model
replace or augment this traditional pipeline? Where would it plausibly help (e.g. unstructured signals, investigator assistance, narrative features) and where do latency, calibration, cost, and adversarial constraints make it a bad fit for the in-line scorer?
A new fraud ring appears that your matured training data has never seen, and it won't be confirmed for 60+ days. What in the design lets you respond
now
, and what are the false-positive risks of acting on early proxies?
Suppose calibration drifts in production but ranking (PR-AUC) looks stable. Which downstream decisions break, and how would you detect and remediate this without a full retrain?