PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/ML System Design/PayPal

Design a traditional fraud detection system

Last updated: Jun 24, 2026

Quick Overview

This question evaluates a Machine Learning Engineer's competency in end-to-end ML system design for real-time payments fraud detection, including labeling under delayed confirmations, handling extreme class imbalance and sampling, feature engineering across behavioral, graph, device and merchant signals, model selection for latency and scale, and production scoring and monitoring architecture. It is commonly asked in the ML System Design category to assess how an engineer balances low-latency decision-making with delayed sparse labels, calibration and threshold trade-offs, operational scalability and resiliency, and drift/adversarial detection, testing both conceptual understanding and practical application.

  • hard
  • PayPal
  • ML System Design
  • Machine Learning Engineer

Design a traditional fraud detection system

Company: PayPal

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

Design an end-to-end fraud detection system. Specify positive/negative labeling strategy given delayed and scarce fraud confirmations, sampling to address extreme class imbalance, feature sets (behavioral, graph, device, merchant), model choices and justification, real-time scoring architecture and latency constraints, thresholding and precision/recall trade-offs, evaluation metrics (PR-AUC, precision@k, cost-sensitive metrics), and monitoring for drift and adversarial adaptation.

Quick Answer: This question evaluates a Machine Learning Engineer's competency in end-to-end ML system design for real-time payments fraud detection, including labeling under delayed confirmations, handling extreme class imbalance and sampling, feature engineering across behavioral, graph, device and merchant signals, model selection for latency and scale, and production scoring and monitoring architecture. It is commonly asked in the ML System Design category to assess how an engineer balances low-latency decision-making with delayed sparse labels, calibration and threshold trade-offs, operational scalability and resiliency, and drift/adversarial detection, testing both conceptual understanding and practical application.

Related Interview Questions

  • Design RL-based spending limit policy - PayPal (hard)
  • Detect credit-card transaction fraud - PayPal (hard)
  • Design fraud detection from raw transactions - PayPal (hard)
|Home/ML System Design/PayPal

Design a traditional fraud detection system

PayPal logo
PayPal
Sep 6, 2025, 12:00 AM
hardMachine Learning EngineerOnsiteML System Design
12
0

Design an End-to-End Real-Time Payments Fraud Detection System

You are a Machine Learning Engineer at a large online payments platform. Design a traditional ML fraud detection system that issues an approve / review / decline decision synchronously at authorization time, then learns from confirmed-fraud labels (chargebacks, disputes, investigations) that arrive weeks later and are scarce.

This is a "classic" ML system design problem — deep neural sequence models and graph learning may appear as components, but the spine is a calibrated, low-latency tabular pipeline. Work through the design end to end, justifying each choice against the latency, label-delay, and class-imbalance constraints.

Constraints & Assumptions

  • Decision point: in-line with the payment authorization flow — the score gates whether the charge is attempted, so the model's inference must fit a single-digit-to-low-tens-of-ms budget inside an end-to-end P99 on the order of tens of milliseconds.
  • Label delay: the true label of a transaction may not be known for 30–90 days (the chargeback/dispute cycle), and some fraud is never confirmed.
  • Class imbalance: transaction-level fraud prevalence is typically well under 1%.
  • Non-stationarity: the environment is adversarial; fraudsters adapt continuously, so signal decays.
  • Train/serve consistency: features computed offline for training must equal features computed online at decision time.
  • Treat exact SLAs, prevalence, and cost figures as values you would negotiate with the business; state your assumptions rather than inventing precise numbers.

Clarifying Questions to Ask

  • What is the cost asymmetry — what does an approved fraudulent transaction cost (loss net of recovery) versus a wrongly declined good transaction (lost margin + customer-experience damage), and what does a manual review cost?
  • Is there a human review queue , and what is its capacity and SLA? (This determines whether "review" is even an available action.)
  • What is the actual latency SLO at authorization, and how is the budget split across the auth flow?
  • What label sources exist (network chargeback reason codes, first-party fraud reports, manual investigation verdicts) and how reliable/timely is each?
  • What is the regulatory / adverse-action context (e.g. need for reason codes, restrictions on protected attributes)?
  • Do we have an exploration budget to approve a small fraction of would-be-declines for unbiased labels?

Part 1 — Labeling under delayed, scarce confirmations

Define how you turn raw transactions into training labels when confirmed fraud arrives weeks late. Cover how you assign positive vs. negative labels, the role of observation/maturity windows, how you handle disputed or undetermined outcomes, and how you avoid target leakage.

What This Part Should Cover

  • A precise positive definition (which confirmed signals, within what window from t0 ) and a "fully matured, no signal" negative definition.
  • Explicit treatment of immature/disputed transactions (not silently labeled negative).
  • The label-freshness vs. label-maturity tension, and a concrete way to react faster (weak/early-proxy positives) without corrupting the clean label set.
  • Concrete anti-leakage discipline: time-based splits and point-in-time feature construction.

Part 2 — Sampling under extreme class imbalance

Describe your offline training strategy for a sub-1% positive rate and what (if anything) you do differently at serving time. Be explicit about how you keep probabilities meaningful.

What This Part Should Cover

  • A tractable scheme (downsample matured negatives, keep all positives and hard negatives, preserve the time distribution).
  • An explicit mechanism to recover true-prior, calibrated probabilities (logit prior-correction and/or post-hoc calibration), and the distinction from loss reweighting.
  • Selection-bias awareness (only-approved-transactions-have-labels) and a mitigation.
  • That sampling is training-only — serving scores every event on the true distribution.

Part 3 — Feature sets

Enumerate the feature families you would build and how they are served. Cover behavioral/velocity, graph/link, device/network, and merchant/context signals.

What This Part Should Cover

  • Behavioral/velocity: multi-window counts/sums keyed by multiple entities, instrument reuse, burstiness, sequence/session signals.
  • Graph/link: identity graph across accounts/cards/devices/IPs and derived risk (neighbor fraud, component size, diffusion), with a note that traversal is too slow in-line.
  • Device/network: fingerprint stability, emulator/proxy flags, IP/ASN reputation, geovelocity.
  • Merchant/context: MCC/merchant chargeback rate, amount-vs-history, 3DS/AVS/CVV outcomes; plus serving/hygiene (online vs. offline store consistency, PII handling, freshness SLAs).

Part 4 — Model choices and justification

Recommend a baseline and the advanced components, justified against latency and scale. Address how graphs, sequences, and semi-/weak supervision fit in without breaking the latency budget.

What This Part Should Cover

  • A justified primary model (GBDT) and why it fits latency/scale/explainability.
  • How advanced signals enter (graph/sequence embeddings as nearline features; entity embeddings for high cardinality) without in-line traversal.
  • A serving strategy that bounds latency (cascade / two-stage) and a rule layer for hard blocks and fallback.
  • Use of weak/semi-supervision and active learning to grow labels.

Part 5 — Real-time scoring architecture and latency

Lay out the end-to-end serving architecture and the latency budget. Cover event ingestion, the online/offline feature store split, streaming aggregations, model serving, and fallbacks; give an illustrative P99 budget and how you keep the system resilient.

What This Part Should Cover

  • The data path: event bus with idempotency, streaming aggregations, online + offline feature stores kept skew-free, stateless model serving.
  • An illustrative P99 allocation, framed as a planning target to be negotiated against the SLO (and not a strict additive sum).
  • Fallbacks and resiliency: degraded-feature handling, model-down fallback to rules, fail-open/closed by risk segment, shadow/blue-green rollout.

Part 6 — Thresholding and precision/recall trade-offs

Turn calibrated scores into the approve / review / decline policy. Show how thresholds come from the cost structure (not guesses), and how a review queue and segment-specific costs change the cutoffs.

What This Part Should Cover

  • A cost-matrix-derived two-action threshold and why amount-aware thresholds beat a single global cutoff.
  • A three-action policy with a review band, set by expected-profit/queue-capacity reasoning, and queue prioritization by expected loss (probability × amount).
  • Per-segment thresholds and surfacing the PR curve so operators pick an operating point.

Part 7 — Evaluation metrics

Specify the offline and online metrics, why they suit extreme imbalance, and — critically — how you evaluate honestly given delayed labels and the fact that historical declines have no observed outcome.

What This Part Should Cover

  • Imbalance-aware ranking and operating-point metrics (PR-AUC, precision/recall@k) plus amount-weighted cost/profit on the business cost matrix.
  • Evaluating only on matured label windows and never scoring a too-recent window.
  • Off-policy / counterfactual evaluation (IPW / doubly-robust, anchored by an exploration holdout) and decision-aware backtesting under queue constraints.

Part 8 — Monitoring for drift and adversarial adaptation

Describe what you monitor before labels mature, how you detect concept drift and adversarial behavior, and your retraining/rollout guardrails.

What This Part Should Cover

  • Pre-label health signals (feature availability/freshness, input and score drift, early label proxies).
  • Concept-drift response: retraining cadence on rolling windows plus fast hotfix retrains, nearline refresh of velocity/graph features.
  • Adversarial monitoring (surge/graph-anomaly detectors) and reversible responses (friction/step-up auth) rather than only hard declines.
  • Rollout guardrails: canary/shadow comparison on amount-weighted loss, kill switches, documented rule-layer fallback.

What a Strong Answer Covers

Across all parts, a strong answer treats fraud detection as a cost-minimization decision system, not an accuracy benchmark, and keeps three cross-cutting threads coherent end to end:

  • The decision objective is dollars, not AUC — every threshold, metric, and sampling choice traces back to a cost matrix (fraud loss vs. declined-good-customer cost vs. review cost), and high-value transactions are held to a stricter bar.
  • Calibration is the connective tissue — labeling, sampling/prior-correction, model objective, thresholding, and evaluation are linked by the requirement that scores be true-prior-calibrated probabilities; the candidate should show why each stage preserves that.
  • Train/serve consistency and honest evaluation under delayed, biased labels — point-in-time features, time-based splits, matured-only evaluation, and off-policy correction recur across labeling, architecture, and metrics; a candidate who only optimizes offline AUC misses the problem.
  • Operational realism and adversarial robustness — latency budgets, fallbacks, a rule layer, reversible friction, and monitoring that front-runs 90-day labels distinguish a deployable design from a notebook model.

Follow-up Questions

  • How would you formulate setting a per-account spending limit as a reinforcement-learning problem (state, action, reward) — and why is RL a poor fit for the in-line approve/decline decision in Part 6?
  • Could a large language model replace or augment this traditional pipeline? Where would it plausibly help (e.g. unstructured signals, investigator assistance, narrative features) and where do latency, calibration, cost, and adversarial constraints make it a bad fit for the in-line scorer?
  • A new fraud ring appears that your matured training data has never seen, and it won't be confirmed for 60+ days. What in the design lets you respond now , and what are the false-positive risks of acting on early proxies?
  • Suppose calibration drifts in production but ranking (PR-AUC) looks stable. Which downstream decisions break, and how would you detect and remediate this without a full retrain?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More PayPal•More Machine Learning Engineer•PayPal Machine Learning Engineer•PayPal ML System Design•Machine Learning Engineer ML System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.