System Design: Scheduled Payments Service
Background
Design a backend service that lets end-users schedule one-time or recurring payments. The service must reliably execute payments at the intended time regardless of the user's time zone and daylight-saving changes, and must be resilient to transient failures and third-party payment processor outages.
Assume you are building the service for a consumer-facing financial app that already has user accounts and tokenized payment methods (e.g., cards, bank accounts, wallets) available via a payment processor. You are responsible for the scheduling, orchestration, and reliable execution layer.
Requirements
Functional
-
Create, update, cancel scheduled payments:
-
One-time (execute exactly once at a future time).
-
Recurring (e.g., daily/weekly/monthly rules; end date optional).
-
Execute payments at the correct local time for the user across time zones and DST transitions.
-
At-least-once semantics for internal job execution; aim for exactly-once charging at the processor using idempotency.
-
Handle retries with backoff for transient errors, and classify permanent failures.
-
Handle third-party processor outages and partial failures (e.g., network timeouts, declines, requires-user-action).
-
Provide APIs for CRUD on schedules and to fetch payment history.
-
Durable job orchestration (e.g., queues, timers, scheduler) with audit logs and user notifications (email/push/webhooks) for key events.
Non-Functional
-
Scale to tens of millions of active schedules and thousands of payments per second at peak.
-
High availability and durability. No missed executions; bounded delay.
-
Observability: metrics, logs, traces, and alerting.
-
Security/compliance appropriate for financial data (PII/PCI, key management, access control).
Deliverables
Provide:
-
High-level architecture and components.
-
Public APIs (endpoints, request/response, idempotency strategy).
-
Storage schema (tables/indices) and data flow, including audit logs.
-
Durable job orchestration design (scheduler, queues/timers, retry strategy).
-
Exactly-once/at-least-once semantics and idempotency approach.
-
Handling of time zones/DST and recurrence rules.
-
Notifications and webhooks design.
-
Scaling/partitioning, monitoring/alerting, and security/compliance considerations.
Constraints and Assumptions
-
You may assume a relational primary store (e.g., Postgres) and a durable queue/stream (e.g., SQS/Kafka) are available.
-
Payment processor supports idempotency keys and retrieving payment status.
-
Use minimal, reasonable assumptions if any details are missing.