Design: Online Donation Platform for 3‑Day Campaigns
Context
You are designing an online donation platform optimized for short, 3‑day fundraising campaigns that can experience large traffic spikes at launch and close. The platform must process payments reliably, show near real-time totals, and comply with privacy and payments regulations.
Functional Requirements
-
Donor
-
Donor signup/login (email, phone, social login optional)
-
Store minimal donor profile and preferences
-
Campaigns
-
Admins create, configure, and schedule 3‑day campaigns (start/end times, goals, currency, geos)
-
Public campaign landing pages with progress bars and leaderboards
-
Donations
-
Initiate and confirm one-time donations; support multiple currencies
-
Real-time campaign totals (amount and count) visible to donors
-
Generate receipts and send email/SMS confirmations
-
Refunds (full/partial) and chargeback handling
-
Operations
-
Admin dashboards for monitoring, exporting, and reconciling
-
Webhooks/exports for finance/BI systems
Non-Functional Goals
-
Availability SLOs
-
End-to-end donation processing: 99.95% monthly
-
Read paths (campaign pages, totals): 99.99% monthly
-
Latency budgets (from API gateway)
-
P50 ≤ 200 ms, P95 ≤ 500 ms, P99 ≤ 1 s for non-payment endpoints
-
Payment confirmation may take up to 3–5 s (dependent on PSP); provide async UX
-
Throughput targets (example sizing; justify assumptions)
-
Peak QPS for donation create: 3–8k QPS during spikes
-
Reads (landing pages, totals): 10× write QPS at peaks
-
Webhook handling: 1–2k QPS bursts
-
Durability and integrity
-
No lost accepted donations; exactly-once ledger semantics via idempotency and reconciliation
Traffic Estimates (Example)
-
Assume 5 campaigns/week; each 3 days; 500k–2M visits/campaign; 50k–300k donations
-
Spiky arrivals: 40% of donations in first 6 hours, 40% in last 6 hours
-
Peak write QPS example: 250k donations over 6 hours ≈ 11.6/s avg, but with bursts 10–100×; plan for 3–8k/s
-
Real-time totals reads: 10–20 reads per write (progress bars, leaderboards)
APIs and Data Models
Design REST (or gRPC) with idempotency keys and async confirmations.
-
Core entities: Campaign, Donor, Donation, PaymentIntent, Receipt
-
Key endpoints (high-level)
-
POST /campaigns
-
GET /campaigns/{id}
-
GET /campaigns/{id}/totals
-
POST /donors
-
POST /payment_intents (idempotent)
-
POST /donations/confirm (with PSP payment_method or token)
-
POST /donations/{id}/refunds
-
GET /donations/{id}, GET /receipts/{id}
-
POST /webhooks/payments (PSP → us)
-
Data models (minimal fields)
-
Campaign
-
id (uuid), name, start_at, end_at, currency, goal_amount, status, settings (metadata), created_by
-
totals: amount_agg, count_agg (eventually consistent), version
-
Donor
-
id (uuid), email, phone(optional), name(optional), country, marketing_opt_in, created_at
-
PaymentIntent
-
id (uuid), campaign_id, donor_id(optional), client_secret/token (from PSP), amount, currency, status (requires_payment_method|requires_confirmation|processing|succeeded|canceled), idempotency_key, expires_at
-
Donation
-
id (uuid), campaign_id, donor_id(optional), payment_intent_id, amount, currency, status (pending|authorized|captured|refunded|failed|chargeback), external_payment_id, risk_score, created_at
-
Receipt
-
id (uuid), donation_id, receipt_number, issued_at, tax_fields, pdf_url/hash
Payment Processing
-
Idempotency
-
All create/confirm endpoints require Idempotency-Key; store request hash + response; dedupe retries
-
Flow
-
Client requests PaymentIntent (server calls PSP to create intent/client_secret)
-
Client collects payment method via PSP SDK (keeps PCI scope low)
-
Server confirm: POST /donations/confirm with payment_intent_id + token → call PSP confirm
-
Statuses
-
Synchronous success → mark Donation captured, emit events, generate receipt
-
Async (3DS/SCA) → mark processing; rely on PSP webhooks to finalize
-
Failure → surface message; allow retry with same intent when possible
-
Retries
-
Network timeouts: safe-to-retry with same Idempotency-Key
-
PSP confirm is idempotent; reconcile using external_payment_id
-
Reconciliation
-
Ingest PSP webhooks (payment_intent.succeeded/failed/refunded/chargeback)
-
Nightly job diff: our ledger vs PSP payouts; resolve mismatches; escalate if needed
-
Refunds
-
POST /donations/{id}/refunds with amount; call PSP refund; update status via webhook; generate refund receipt
-
Receipts
-
On capture/refund, create immutable receipt record; send email; allow re-download; include tax fields per locale
Fraud and Abuse Mitigation
-
Pre-authorization checks: velocity limits (per IP/device/card/email), disposable email detection, proxy/VPN detection
-
Risk scoring: rules + ML (features: IP geolocation, BIN, AVS/CVV result, device fingerprint, donation velocity, amount anomalies)
-
Strong customer authentication: enable 3DS/SCA where required/beneficial
-
Block/allow lists: BIN ranges, IPs, emails, devices
-
Chargeback handling: store evidence packet; export to PSP
-
Bot mitigation: CAPTCHA after thresholds; WAF rules for spikes
Rate Limiting
-
Token bucket per IP and per donor account (e.g., 20 req/s burst, 5 req/s sustained)
-
Endpoint-specific tighter limits for payment endpoints (e.g., 2 confirms/s per donor, 10/min per card)
-
Global circuit breaker to protect PSP and ledger; return 429 with Retry-After
Privacy and Compliance
-
PCI
-
Use PSP hosted fields/SDK; never handle raw PAN/CVV; store only tokens/last4/brand/exp; SAQ-A compliance target
-
PII
-
Data minimization; encrypt at rest (FIPS 140-2) and in transit (TLS 1.2+); field-level encryption for email/phone
-
Access controls (RBAC/ABAC), audit logs, DLP monitoring
-
CCPA/GDPR
-
Consent tracking, privacy policy links, purpose limitation
-
Data subject rights: access, deletion, rectification, portability; deletion workflows with event propagation
-
Data retention: configurable; pseudonymize for analytics; regional data residency where applicable
Architecture
-
Edge
-
CDN + WAF + Bot protection; API Gateway (rate limiting, auth, idempotency middleware)
-
Services (separate deployable units)
-
Campaign Service (CRUD, totals read)
-
Donor Service (profiles, consent)
-
Donation Service (Donation lifecycle, receipts)
-
Payment Service (PSP integration, idempotency, webhook handler)
-
Risk Service (risk scoring, rules engine)
-
Notification Service (email/SMS, receipt generation)
-
Reporting/Reconciliation Service (exports, nightly diff)
-
Data
-
Primary OLTP DB (Postgres/MySQL) for strong consistency of transactional data; read replicas
-
Redis/Memcached for session, idempotency, rate limits, real-time counters
-
Event bus/stream (Kafka/PubSub) for outbox events, receipts, analytics
-
Object storage for receipts (PDF), exports; signed URLs
-
Patterns
-
Outbox/transactional messaging for exactly-once event publication
-
Sagas for cross-service workflows (donation confirmed → receipt → email)
-
Webhook endpoint isolated and heavily rate-limited; retry-friendly, idempotent
Scaling and Sharding Strategy
-
Shard by campaign_id for write-heavy tables (donations, intents) when crossing single-node limits; keep donor global
-
Read scaling via replicas and Redis caching (campaign pages, totals)
-
PSP integration: use connection pools, circuit breakers, batch operations where available
-
Autoscale stateless services on CPU/QPS; pre-warm at campaign launch windows
Consistency Model for Counters (Real-Time Totals)
-
Requirements: UI needs near real-time totals; correctness for finance must be exact
-
Approach
-
Strong source of truth: ledger in OLTP (exact totals, eventual propagation)
-
Real-time counter: Redis sharded counters per campaign (INCRBY on capture), published via pub/sub to websockets
-
Periodic true-up: every N seconds/minutes recompute from authoritative events; correct drift
-
Tolerate small staleness (≤ 1–2 s) on UI; show "Last updated" badge
Handling Launch/Closing Traffic Spikes
-
Pre-provision capacity: autoscale min replicas; warm caches; CDN priming
-
Queue writes under extreme burst: accept intent, enqueue confirm worker; inform user if async
-
Graceful degradation: slow mode on launch/close (lower image/video weight, cache totals more aggressively)
-
Backpressure: tighten rate limits, exponential backoff; shed non-essential traffic (analytics)
Observability
-
Metrics (RED/USE)
-
Request rate, errors, latency per endpoint; PSP call success/latency; webhook lag; queue depth; risk scores distribution
-
Business: donations/min, authorization rate, capture rate, refund rate, chargeback rate, conversion funnel
-
Logs
-
Structured JSON with correlation IDs; PII scrubbing; sampling for high-volume routes
-
Tracing
-
Distributed tracing across gateway → services → PSP; annotate spans with idempotency keys and external_payment_id
-
Alerting & SLOs
-
Error budget tracking; alerts on SLO burn, PSP degradation, counter drift, reconciliation mismatches
-
Synthetic tests during campaigns; canary deploys
Disaster Recovery (DR)
-
Multi-AZ active by default; multi-region strategy
-
Active/active for stateless; active/passive for DB with async replication
-
RPO/RTO targets
-
RPO ≤ 5 minutes (binlog/stream replication); RTO ≤ 30 minutes failover
-
Regular backups, point-in-time recovery; DR drills and runbooks
-
Idempotent reprocessing on failover using outbox and event replay
Backfills and Reprocessing
-
Store raw PSP events and our emitted events in append-only topics
-
Idempotent consumers keyed by donation_id/external_payment_id
-
Backfill jobs for receipts/totals recomputation; versioned schemas and migration playbooks
Extensibility
-
Recurring Donations
-
Store PaymentMethod tokens; scheduler service to create PaymentIntents on cadence; email pre-notifications; retry ladders; proration rules for campaign windows
-
Corporate Matching
-
Employer registry and matching policies; collect employer info; hold match pledge vs funded state; reporting and disbursement workflows; partner APIs for verification