System Design: Real-Time Payment Fraud Detection
Context
Design a real-time fraud detection system for online payments (card-not-present). The system must score each transaction during authorization and decide whether to approve, decline, or route to manual review within a tight latency budget.
Assume:
-
End-to-end p95 decision latency budget: 100 ms (from feature retrieval to decision), with soft degradations permitted.
-
Labels (e.g., chargebacks) arrive with delays (weeks). You must train with delayed/noisy labels and operate with streaming features.
Requirements
Discuss and propose designs for:
-
Events and Labels
-
What events to ingest (e.g., authorizations, captures, refunds, chargebacks, disputes, user actions).
-
How to define positive/negative labels (chargebacks, disputes) and handle label delay.
-
Feature Store
-
Feature categories (user, device, merchant, payment instrument, velocity, graph/network features).
-
Offline vs. online stores, consistency, TTL, backfilling, and time-travel for training.
-
Model Selection
-
Compare tree ensembles, deep models (e.g., sequence or representation models), and anomaly detection for cold start.
-
Calibration, class imbalance handling, and cost-sensitive learning.
-
Rule Engine + Model Ensemble
-
Combining deterministic rules with ML scores, ensembling strategies, and reason codes.
-
Data Pipeline and Streaming Inference
-
Ingestion, stream processing, feature computation, online retrieval, and a low-latency inference service.
-
Latency Budgets and Fallbacks
-
Budget breakdown, caching, degradation paths (e.g., rules-only), and idempotency.
-
Thresholding and Trade-offs
-
How to set thresholds to balance false positives vs. fraud loss; expected value formulation.
-
Human-in-the-Loop Review
-
Review queue design, sampling strategies, SLAs, active learning, and feedback loops.
-
Concept Drift and Adversarial Adaptation
-
Continuous training, drift detection, canaries, and defenses.
-
Explainability Requirements
-
Feature attributions, rule traces, and audit logging.
-
Online Experiments
-
A/B/shadow testing, guardrail metrics, ramp policy, and bias control.
-
Monitoring and Alerting
-
Precision at top-K, approval rate, fraud rate, latency SLOs, data quality, and feature drift.
-
Incident Response and Rollback
-
Kill switches, model/version rollback, runbooks, and postmortems.