PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/ML System Design/Amazon

Design a fraud detection system

Last updated: Jun 25, 2026

Quick Overview

This question evaluates a candidate's ability to design end-to-end ML systems under real-world constraints, specifically targeting fraud detection at scale. It tests competency in low-latency serving, delayed label handling, class imbalance, and adversarial data drift — core ML System Design skills assessed in senior engineering interviews.

  • hard
  • Amazon
  • ML System Design
  • Software Engineer

Design a fraud detection system

Company: Amazon

Role: Software Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Technical Screen

Design a real-time payment fraud detection system. Discuss: events and labels (chargebacks, disputes), feature store (user, device, merchant, graph features), model selection (tree ensembles, deep models, anomaly detection), rule engine + model ensemble, data pipeline and streaming inference, latency budgets and fallbacks, thresholding to balance false positives vs. fraud loss, human-in-the-loop review, concept drift and adversarial adaptation, explainability requirements, online experiments, monitoring (precision at top-K, approval rate, fraud rate), and incident response/rollback.

Quick Answer: This question evaluates a candidate's ability to design end-to-end ML systems under real-world constraints, specifically targeting fraud detection at scale. It tests competency in low-latency serving, delayed label handling, class imbalance, and adversarial data drift — core ML System Design skills assessed in senior engineering interviews.

Related Interview Questions

  • Design systems for global request detection and labeling - Amazon (hard)
  • Design a computer-use agent end-to-end - Amazon (medium)
  • Debug online worse than offline model performance - Amazon (medium)
  • Approach an ambiguous business problem - Amazon (medium)
  • Explain parallelism and collectives in training - Amazon (medium)
|Home/ML System Design/Amazon

Design a fraud detection system

Amazon logo
Amazon
Aug 10, 2025, 12:00 AM
hardSoftware EngineerTechnical ScreenML System Design
11
0

Design a Real-Time Payment Fraud Detection System

Design an ML-powered system that scores each online card-not-present (CNP) payment during authorization and decides whether to approve, decline, challenge / step-up (e.g., 3-D Secure), or route to manual review — all within a tight, synchronous latency budget. Outcome labels such as chargebacks arrive weeks later, so you must train with delayed and noisy labels while operating on fresh streaming features.

This is an end-to-end ML system design question: it spans the data foundation (events, labels, feature store), the modeling stack (rules + ML ensemble), the low-latency serving path, the decisioning policy, and the full operational lifecycle (drift, explainability, experimentation, monitoring, incident response).

Constraints & Assumptions

  • Latency: End-to-end p95 decision latency budget is 100 ms , measured from the start of feature retrieval to the emitted decision. Soft degradations (rules-only fallback) are permitted under failure.
  • Label delay: Chargebacks / disputes resolve weeks to months after the transaction (commonly a 90–180 day observation window depending on card network rules). You cannot wait for ground truth to make decisions.
  • Class imbalance & asymmetric cost: Fraud is rare (often well under 1% of transactions) and a missed fraud (chargeback loss + fees) is far more expensive than a single false decline of a good customer — but excessive false declines erode revenue and trust.
  • Adversarial environment: Fraudsters actively probe and adapt; the data distribution shifts as defenses change.
  • Scale: Assume a high-throughput processor (thousands of transactions per second at peak); the design must scale horizontally and degrade gracefully.

Clarifying Questions to Ask

A strong candidate scopes the problem before designing. Reasonable questions include:

  • What is the decision space — just approve/decline, or do we also have step-up authentication (3-D Secure) and a manual-review queue as intermediate actions?
  • What are the business objectives and guardrails — are we optimizing fraud-loss-minus-revenue, an approval-rate floor, a regulatory false-positive ceiling, or a target chargeback rate (e.g., to stay under card-network monitoring programs)?
  • What is the fraud base rate and the typical chargeback amount distribution ? This drives thresholds and cost weighting.
  • Who bears the loss (issuer vs. acquirer vs. merchant) and does liability shift on step-up (3-D Secure)? This changes the value of challenge vs. decline.
  • What data and infrastructure already exist — an event bus, an offline warehouse, an online KV store, a model registry — and what is greenfield?
  • What are the compliance constraints — PCI-DSS, regional data residency, right-to-erasure, adverse-action / reason-code requirements?

Part 1: Events, Labels, and the Feature Store

Design the data foundation. Specify (a) which events to ingest and how they flow, (b) how to define positive/negative labels from delayed chargebacks/disputes and handle label delay without leakage, and (c) the feature store — feature categories, the offline vs. online split, train/serve consistency, TTL/freshness, backfilling, and point-in-time ("time-travel") joins for training.

What This Part Should Cover

  • A concrete event taxonomy (payment lifecycle, post-transaction outcomes, account/device signals, merchant signals, external intelligence) flowing through an ordered, replayable log.
  • A leakage-safe labeling scheme that respects the observation window, distinguishes confirmed-fraud from friendly-fraud/disputes, and treats in-window transactions as positive-unlabeled.
  • Feature categories spanning entities (user, device, payment instrument, merchant) plus velocity and graph/network features.
  • A dual store (offline columnar + online low-latency KV) with shared transforms, per-feature TTLs, watermarks for late/out-of-order data, backfills on schema change, and point-in-time correct training joins.

Part 2: Model Selection and the Rule Engine + ML Ensemble

Design the scoring stack. (a) Compare model families — gradient-boosted tree ensembles, deep models (sequence/representation/graph), and unsupervised anomaly detection for cold start — and address calibration, class imbalance, and cost-sensitive learning. (b) Design how a deterministic rule engine composes with the ML ensemble, the ensembling strategy, and how reason codes are produced.

Clarifying Questions for this Part

  • Are there hard regulatory or compliance rules (sanctions lists, mandatory step-up triggers) that must always fire regardless of the model score?
  • Do we need monotonic constraints (e.g., risk must not decrease as velocity increases) for auditability or fairness?

What This Part Should Cover

  • A layered modeling rationale : GBDT baseline, when deep/sequence/graph models add value, and anomaly detection for cold start — with the latency/maintainability trade-offs stated.
  • Calibration (isotonic/Platt, ideally per segment), imbalance handling (class weights, focal loss, balanced sampling), and cost-sensitive training tied to expected loss.
  • A rules + ML composition : deterministic safety/compliance rules and rate-limits first, then a calibrated ensemble (blending/stacking, optionally segment experts), emitting reason codes from both rule traces and feature attributions.

Part 3: Data Pipeline, Streaming Inference, Latency Budget, and Fallbacks

Design the runtime. (a) Specify the ingestion → stream processing → feature computation → online retrieval → low-latency inference pipeline. (b) Give a concrete latency budget breakdown within the 100 ms p95, the caching strategy, degradation/fallback paths (e.g., rules-only) when dependencies are slow or down, and idempotency of decisions.

What This Part Should Cover

  • A coherent async data path (event bus → stream processors → online + offline stores + data lake) feeding a synchronous decision path .
  • A stateless, autoscaled inference service with parallel feature fetch, per-dependency timeouts, L1 cache, and a fast scoring runtime.
  • A numeric latency budget that sums to ≤100 ms p95 with headroom, plus explicit degradation tiers (stale cache → minimal features → rules-only) and idempotency keyed on transaction id.

Part 4: Decisioning Policy and Human-in-the-Loop Review

Design how scores become actions. (a) Derive a threshold / expected-value policy that balances false positives (lost revenue, customer friction) against fraud loss, including the cost formulation and segment-specific thresholds. (b) Design the manual-review subsystem — queue tiering, sampling, SLAs, the analyst tooling, and the feedback / active-learning loop back into labels and rules.

Clarifying Questions for this Part

  • Does step-up authentication shift liability to the issuer? If so, "challenge" can be strictly preferred to "decline" for a band of medium-risk transactions.
  • What is the manual-review capacity (analysts × throughput)? This caps the review rate and forces the queue to be risk-prioritized.

What This Part Should Cover

  • A cost-sensitive, calibrated decision rule (expected-value formulation) with per-segment thresholds and hysteresis to avoid threshold flapping.
  • Correct use of the intermediate actions (challenge/step-up, manual review) given liability and capacity, not just a binary approve/decline.
  • A review pipeline : tiered routing by risk, SLAs with safe auto-timeout behavior, analyst tooling (reason codes, linked-entity/graph view, prior decisions), and a closed feedback loop (sampling for QA, active learning, verdicts → labels/rules).

Part 5: Lifecycle and Operations — Drift, Explainability, Experiments, Monitoring, and Incident Response

Design the operational lifecycle. Cover: (a) concept drift & adversarial adaptation (continuous training, drift detection, defenses); (b) explainability (feature attributions, rule traces, audit logging) for support and regulators; (c) online experiments (shadow/canary/A-B, guardrail metrics, ramp policy, bias control); (d) monitoring & alerting (precision@top-K, approval rate, fraud rate, latency SLOs, data quality, feature drift); and (e) incident response & rollback (kill switches, versioned rollback, runbooks, postmortems).

What This Part Should Cover

  • Drift detection on leading indicators (PSI/KL on features and scores, proxy labels) plus a continuous-training cadence with a long-tail replay buffer, and concrete adversarial defenses .
  • Per-decision explainability persisted for audit: rules fired (with thresholds), feature attributions (e.g., SHAP, precomputed or attached asynchronously), model version, and a feature-vector hash — mapped to customer-facing reason codes with PII redaction.
  • A safe experimentation ladder (shadow → canary → A/B) with real-time and delayed guardrail metrics, variance reduction (e.g., CUPED/stratification), predefined ramp/stop criteria, and bias controls.
  • Monitoring of decisioning SLOs, business metrics (precision@top-K, approval rate, lagged fraud rate, review/win rate), model health (calibration, score drift, feature freshness/nulls), and data-quality checks, with tiered alerting.
  • Incident controls : kill switch to rules-only, per-segment threshold bumps, immutable versioned models with blue/green rollback, runbooks, and blameless postmortems.

What a Strong Answer Covers

Across all parts, the interviewer is looking for an end-to-end design that holds together rather than a list of buzzwords. The strongest answers demonstrate:

  • A clean split between the synchronous decision path and the asynchronous learning path , with the two connected only through well-defined stores and an event log.
  • Train/serve consistency and leakage discipline treated as first-class concerns — shared feature definitions, point-in-time joins, and the delayed-label problem handled coherently from labeling through evaluation through drift monitoring.
  • Cost-aware, calibrated decisioning rather than a single accuracy-maximizing classifier — probabilities that mean something, asymmetric costs, segment-specific thresholds, and the right use of intermediate actions (challenge, review).
  • Operating under failure : explicit latency budgeting, graceful degradation to rules-only, idempotency, kill switches, and rollback.
  • A self-improving loop : human review and monitoring feed labels and rules; experiments gate every change; drift and adversarial pressure are anticipated, not patched after the fact.
  • Compliance, privacy, and auditability woven in (PCI-DSS, data residency, reason codes / adverse-action, audit logging) rather than bolted on.

Follow-up Questions

  • A new, large-volume fraud attack appears overnight using a pattern your model has never seen and your features don't capture. Walk through your first 30 minutes of response — detection, containment, and the path back to a model fix.
  • Your false-decline rate just spiked and customer complaints are rising, but the lagged fraud rate looks normal. How do you diagnose whether this is a model regression, a feature-freshness/data-quality bug, train/serve skew, or a legitimate distribution shift — and how do you mitigate while you investigate?
  • Suppose chargeback labels effectively take 120 days to mature. How would you shorten the feedback loop to retrain meaningfully faster without trusting noisy early signals as ground truth?
  • A regulator (or a declined customer) demands an explanation for a specific decline. What exactly can you produce, how do you guarantee it reflects the model version and features that actually ran, and what are the limits of post-hoc attributions like SHAP?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Amazon•More Software Engineer•Amazon Software Engineer•Amazon ML System Design•Software Engineer ML System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.