This question evaluates a candidate's ability to design end-to-end ML systems under real-world constraints, specifically targeting fraud detection at scale. It tests competency in low-latency serving, delayed label handling, class imbalance, and adversarial data drift — core ML System Design skills assessed in senior engineering interviews.
Design a real-time payment fraud detection system. Discuss: events and labels (chargebacks, disputes), feature store (user, device, merchant, graph features), model selection (tree ensembles, deep models, anomaly detection), rule engine + model ensemble, data pipeline and streaming inference, latency budgets and fallbacks, thresholding to balance false positives vs. fraud loss, human-in-the-loop review, concept drift and adversarial adaptation, explainability requirements, online experiments, monitoring (precision at top-K, approval rate, fraud rate), and incident response/rollback.
Quick Answer: This question evaluates a candidate's ability to design end-to-end ML systems under real-world constraints, specifically targeting fraud detection at scale. It tests competency in low-latency serving, delayed label handling, class imbalance, and adversarial data drift — core ML System Design skills assessed in senior engineering interviews.
hardSoftware EngineerTechnical ScreenML System Design
11
0
Design a Real-Time Payment Fraud Detection System
Design an ML-powered system that scores each online card-not-present (CNP) payment during authorization and decides whether to approve, decline, challenge / step-up (e.g., 3-D Secure), or route to manual review — all within a tight, synchronous latency budget. Outcome labels such as chargebacks arrive weeks later, so you must train with delayed and noisy labels while operating on fresh streaming features.
This is an end-to-end ML system design question: it spans the data foundation (events, labels, feature store), the modeling stack (rules + ML ensemble), the low-latency serving path, the decisioning policy, and the full operational lifecycle (drift, explainability, experimentation, monitoring, incident response).
Constraints & Assumptions
Latency:
End-to-end p95 decision latency budget is
100 ms
, measured from the start of feature retrieval to the emitted decision. Soft degradations (rules-only fallback) are permitted under failure.
Label delay:
Chargebacks / disputes resolve weeks to months after the transaction (commonly a 90–180 day observation window depending on card network rules). You cannot wait for ground truth to make decisions.
Class imbalance & asymmetric cost:
Fraud is rare (often well under 1% of transactions) and a missed fraud (chargeback loss + fees) is far more expensive than a single false decline of a good customer — but excessive false declines erode revenue and trust.
Adversarial environment:
Fraudsters actively probe and adapt; the data distribution shifts as defenses change.
Scale:
Assume a high-throughput processor (thousands of transactions per second at peak); the design must scale horizontally and degrade gracefully.
Clarifying Questions to Ask
A strong candidate scopes the problem before designing. Reasonable questions include:
What is the
decision space
— just approve/decline, or do we also have step-up authentication (3-D Secure) and a manual-review queue as intermediate actions?
What are the
business objectives and guardrails
— are we optimizing fraud-loss-minus-revenue, an approval-rate floor, a regulatory false-positive ceiling, or a target chargeback rate (e.g., to stay under card-network monitoring programs)?
What is the
fraud base rate
and the typical
chargeback amount distribution
? This drives thresholds and cost weighting.
Who
bears the loss
(issuer vs. acquirer vs. merchant) and does liability shift on step-up (3-D Secure)? This changes the value of challenge vs. decline.
What
data and infrastructure
already exist — an event bus, an offline warehouse, an online KV store, a model registry — and what is greenfield?
What are the
compliance constraints
— PCI-DSS, regional data residency, right-to-erasure, adverse-action / reason-code requirements?
Part 1: Events, Labels, and the Feature Store
Design the data foundation. Specify (a) which events to ingest and how they flow, (b) how to define positive/negative labels from delayed chargebacks/disputes and handle label delay without leakage, and (c) the feature store — feature categories, the offline vs. online split, train/serve consistency, TTL/freshness, backfilling, and point-in-time ("time-travel") joins for training.
What This Part Should Cover
A concrete
event taxonomy
(payment lifecycle, post-transaction outcomes, account/device signals, merchant signals, external intelligence) flowing through an ordered, replayable log.
A
leakage-safe labeling scheme
that respects the observation window, distinguishes confirmed-fraud from friendly-fraud/disputes, and treats in-window transactions as positive-unlabeled.
Feature categories
spanning entities (user, device, payment instrument, merchant) plus
velocity
and
graph/network
features.
A
dual store
(offline columnar + online low-latency KV) with shared transforms, per-feature TTLs, watermarks for late/out-of-order data, backfills on schema change, and point-in-time correct training joins.
Part 2: Model Selection and the Rule Engine + ML Ensemble
Design the scoring stack. (a) Compare model families — gradient-boosted tree ensembles, deep models (sequence/representation/graph), and unsupervised anomaly detection for cold start — and address calibration, class imbalance, and cost-sensitive learning. (b) Design how a deterministic rule engine composes with the ML ensemble, the ensembling strategy, and how reason codes are produced.
Clarifying Questions for this Part
Are there
hard regulatory or compliance rules
(sanctions lists, mandatory step-up triggers) that must always fire regardless of the model score?
Do we need
monotonic constraints
(e.g., risk must not decrease as velocity increases) for auditability or fairness?
What This Part Should Cover
A
layered modeling rationale
: GBDT baseline, when deep/sequence/graph models add value, and anomaly detection for cold start — with the latency/maintainability trade-offs stated.
Calibration
(isotonic/Platt, ideally per segment),
imbalance handling
(class weights, focal loss, balanced sampling), and
cost-sensitive
training tied to expected loss.
A
rules + ML composition
: deterministic safety/compliance rules and rate-limits first, then a calibrated ensemble (blending/stacking, optionally segment experts), emitting
reason codes
from both rule traces and feature attributions.
Part 3: Data Pipeline, Streaming Inference, Latency Budget, and Fallbacks
Design the runtime. (a) Specify the ingestion → stream processing → feature computation → online retrieval → low-latency inference pipeline. (b) Give a concrete latency budget breakdown within the 100 ms p95, the caching strategy, degradation/fallback paths (e.g., rules-only) when dependencies are slow or down, and idempotency of decisions.
What This Part Should Cover
A coherent
async data path
(event bus → stream processors → online + offline stores + data lake) feeding a
synchronous decision path
.
A
stateless, autoscaled inference service
with parallel feature fetch, per-dependency timeouts, L1 cache, and a fast scoring runtime.
A
numeric latency budget
that sums to ≤100 ms p95 with headroom, plus explicit
degradation tiers
(stale cache → minimal features → rules-only) and
idempotency
keyed on transaction id.
Part 4: Decisioning Policy and Human-in-the-Loop Review
Design how scores become actions. (a) Derive a threshold / expected-value policy that balances false positives (lost revenue, customer friction) against fraud loss, including the cost formulation and segment-specific thresholds. (b) Design the manual-review subsystem — queue tiering, sampling, SLAs, the analyst tooling, and the feedback / active-learning loop back into labels and rules.
Clarifying Questions for this Part
Does
step-up authentication shift liability
to the issuer? If so, "challenge" can be strictly preferred to "decline" for a band of medium-risk transactions.
What is the
manual-review capacity
(analysts × throughput)? This caps the review rate and forces the queue to be risk-prioritized.
What This Part Should Cover
A
cost-sensitive, calibrated decision rule
(expected-value formulation) with
per-segment thresholds
and hysteresis to avoid threshold flapping.
Correct use of the
intermediate actions
(challenge/step-up, manual review) given liability and capacity, not just a binary approve/decline.
A
review pipeline
: tiered routing by risk, SLAs with safe auto-timeout behavior, analyst tooling (reason codes, linked-entity/graph view, prior decisions), and a
closed feedback loop
(sampling for QA, active learning, verdicts → labels/rules).
Part 5: Lifecycle and Operations — Drift, Explainability, Experiments, Monitoring, and Incident Response
Drift detection on leading indicators
(PSI/KL on features and scores, proxy labels) plus a continuous-training cadence with a long-tail replay buffer, and concrete
adversarial defenses
.
Per-decision explainability
persisted for audit: rules fired (with thresholds), feature attributions (e.g., SHAP, precomputed or attached asynchronously), model version, and a feature-vector hash — mapped to customer-facing reason codes with PII redaction.
A safe
experimentation ladder
(shadow → canary → A/B) with real-time and delayed guardrail metrics, variance reduction (e.g., CUPED/stratification), predefined ramp/stop criteria, and bias controls.
Monitoring
of decisioning SLOs, business metrics (precision@top-K, approval rate, lagged fraud rate, review/win rate), model health (calibration, score drift, feature freshness/nulls), and data-quality checks, with tiered alerting.
Incident controls
: kill switch to rules-only, per-segment threshold bumps, immutable versioned models with blue/green rollback, runbooks, and blameless postmortems.
What a Strong Answer Covers
Across all parts, the interviewer is looking for an end-to-end design that holds together rather than a list of buzzwords. The strongest answers demonstrate:
A clean split between the synchronous decision path and the asynchronous learning path
, with the two connected only through well-defined stores and an event log.
Train/serve consistency and leakage discipline
treated as first-class concerns — shared feature definitions, point-in-time joins, and the delayed-label problem handled coherently from labeling through evaluation through drift monitoring.
Cost-aware, calibrated decisioning
rather than a single accuracy-maximizing classifier — probabilities that mean something, asymmetric costs, segment-specific thresholds, and the right use of intermediate actions (challenge, review).
Operating under failure
: explicit latency budgeting, graceful degradation to rules-only, idempotency, kill switches, and rollback.
A self-improving loop
: human review and monitoring feed labels and rules; experiments gate every change; drift and adversarial pressure are anticipated, not patched after the fact.
Compliance, privacy, and auditability
woven in (PCI-DSS, data residency, reason codes / adverse-action, audit logging) rather than bolted on.
Follow-up Questions
A new, large-volume fraud attack appears overnight using a pattern your model has never seen and your features don't capture. Walk through your
first 30 minutes
of response — detection, containment, and the path back to a model fix.
Your false-decline rate just spiked and customer complaints are rising, but the lagged fraud rate looks normal.
How do you diagnose
whether this is a model regression, a feature-freshness/data-quality bug, train/serve skew, or a legitimate distribution shift — and how do you mitigate while you investigate?
Suppose chargeback labels effectively take 120 days to mature. How would you
shorten the feedback loop
to retrain meaningfully faster without trusting noisy early signals as ground truth?
A regulator (or a declined customer) demands an explanation for a specific decline. What exactly can you produce, how do you guarantee it reflects the model version and features that actually ran, and what are the limits of post-hoc attributions like SHAP?