Design a Payment Fraud Detection Service
Company: PayPal
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
Design a real-time **fraud detection service** for a payment platform. When a user submits a payment attempt, the platform calls your service *before* authorizing or capturing the charge, and your service must return a decision: **allow**, **deny**, **challenge** (e.g. step-up authentication), or **send to manual review**.
The service should reason over signals such as transaction amount, merchant, user account history, device fingerprint, IP address, geolocation, payment instrument, velocity signals (how often a card/device/account has been used recently), chargeback history, and known fraud patterns. It must combine **deterministic rules** (written and owned by risk analysts) with **machine-learning model scores**, do so under tight latency, and remain explainable, auditable, and highly available. Risk analysts must be able to author, test, and deploy new rules and models safely, and confirmed-fraud / chargeback / manual-review outcomes must flow back as labels to improve the system over time.
```hint Where to start
Treat this as an online, synchronous scoring service sitting on the payment critical path. Separate the three things that must coexist: a **rules engine** (deterministic, analyst-owned), a **model serving** path (probabilistic score), and a **decision engine** that fuses both with business policy into one of the four actions.
```
```hint Latency and the read path
The expensive part is fetching fresh aggregates ("# attempts by this card in the last 10 min"). Pre-compute these with a stream processor into a low-latency **online feature store** so the request path is point lookups, not on-the-fly aggregation. Put strict timeouts on every dependency.
```
```hint Closing the loop
Fraud labels arrive *days to weeks* late (chargebacks, disputes). Design the offline feedback pipeline and a **feature snapshot** stored at decision time so you can train on exactly the features the model saw and explain any past decision.
```
```hint Failure behavior
Decide fail-open vs. fail-closed *per risk tier*, not globally — a degraded model or feature store should not block all payments, but it also should not wave through high-value suspicious ones.
```
### Constraints & Assumptions
- Online decisioning is on the payment critical path; target **p99 latency in the low hundreds of milliseconds** (e.g. < 100–200 ms) for the synchronous risk check.
- High request volume (tens of thousands of payment attempts per second at peak for a large platform); assume bursty traffic.
- **High availability** is required — fraud-service downtime stalls payment processing.
- Fraud labels are **delayed and noisy** (chargebacks can land 30+ days after the transaction; not all fraud is ever labeled).
- The service must be **explainable**: every deny / review must carry reason codes for customer support, compliance, and dispute handling.
- Rule and model changes must roll out **safely** (no big-bang deploys to a live money-movement path).
- Sensitive payment data (PAN, etc.) must be tokenized / minimized; treat PCI and privacy obligations as hard constraints.
### Clarifying Questions to Ask
- What is the **call pattern** — is the risk check synchronous and blocking before authorization, or can some decisions be made asynchronously (e.g. post-auth holds)?
- What are the **business priorities** for the decision tradeoff — minimize fraud loss, maximize approval/conversion, or a target chargeback rate? This sets the thresholds.
- What is the expected **QPS and latency budget**, and is it uniform globally or regional?
- Who **owns rules** and how fast must they ship (e.g. an analyst reacting to a live fraud attack in minutes)?
- What **labels and feedback** are available (chargebacks, disputes, manual-review outcomes, confirmed fraud) and with what delay?
- Are there **compliance / regulatory** requirements (PCI-DSS, sanctions screening, regional data residency) that constrain storage and the decision flow?
- Is there an existing **feature platform / model serving** infra to reuse, or is this greenfield?
### What a Strong Answer Covers
- **Requirements framing**: separates functional (decision, rules + model, analyst tooling, feedback ingestion, auditability) from non-functional (latency, availability, explainability, safe rollout, security) and ties decisions back to the business tradeoff (fraud loss vs. approval rate).
- **Synchronous decision path**: a clean Risk API, an idempotent request/response contract with reason codes and rule/model versions, and a decision engine that fuses hard rules, the model score, and policy into allow / deny / challenge / review.
- **Feature architecture**: an online feature store fed by a stream processor; the same feature definitions used offline for training to avoid **training–serving skew**; a per-decision feature snapshot.
- **Rules engine**: versioning, approval workflow, dry-run / shadow, allow/block lists, and full audit of who changed what.
- **Model lifecycle**: registry with versioned models + feature schema, shadow → canary rollout, a rules-only fallback, and monitoring for drift and feature freshness.
- **Scale & availability**: handling QPS within the latency budget, multi-zone deployment, caching of static data, strict timeouts, circuit breakers, and an explicit **fail-open vs. fail-closed** policy by risk tier.
- **Auditability & monitoring**: what is persisted per decision, plus the key health metrics (latency/error rate, decision distribution, chargeback / confirmed-fraud rate, false-positive rate from review, score drift).
- **Security & privacy**: tokenization, encryption, role-based access, tamper-resistant logs, and regulatory alignment.
### Follow-up Questions
- A new fraud attack pattern emerges that the model has never seen. How does an analyst respond **within minutes**, and how do you ensure the rule they ship doesn't accidentally block a large swath of legitimate traffic?
- Chargeback labels arrive 30–60 days after the transaction. How does this label delay affect model retraining cadence and your ability to detect a sudden model-quality regression *quickly*? What faster proxy signals could you watch?
- How do you measure whether a deny/challenge decision was *correct* given that you never observe the counterfactual outcome of transactions you blocked? What sampling or experimentation could break this feedback bias?
- The model serving tier degrades and starts timing out under a traffic spike. Walk through exactly what your service returns for a $5 transaction vs. a $5,000 transaction during the outage, and why.
Quick Answer: This system design question tests the ability to architect a real-time, low-latency fraud decisioning service that combines deterministic rule engines with machine learning model scores. It evaluates practical understanding of distributed systems trade-offs, feature store design, and safe model/rule deployment on a high-availability critical path.