Distributed System Design For Ledgers And Counters

What's being tested

Stripe-style system design for a Software Engineer probes whether you can build systems where money-like state and high-volume event-derived state have different correctness requirements. A ledger demands strong consistency, idempotency, auditability, and clear failure semantics; an activity counter often accepts bounded staleness but must handle scale, hot keys, deduplication, and time-windowed aggregation. The interviewer is looking for crisp separation of invariants: what must be linearizable, what can be eventually consistent, and how you recover after retries, crashes, and external dependency failures. Stripe cares because payments infrastructure frequently combines durable financial records with operational counters, limits, fraud signals, and integrations with third-party services.

Core knowledge

Ledger systems should be modeled as append-only immutable entries, not mutable balances. Store debits and credits as journal lines with a transaction id, account id, currency, amount, and timestamp; derive balance as $\sum credits - \sum debits$ , optionally materialized for speed.
Double-entry accounting is the core invariant for financial movement: every transaction must balance, typically $\sum_i amount_i = 0$ in a single currency and ledger domain. Enforce this inside one database transaction or equivalent consensus-backed write path.
Linearizability matters for operations like “reserve funds,” “capture payment,” or “prevent overdraft.” If two concurrent writes can both observe the same prior balance, you need serializable isolation, conditional writes, row locks, or a consensus store such as Spanner, FoundationDB, or carefully configured Postgres.
Idempotency keys turn client or worker retries into safe operations. Persist (idempotency_key, operation_type, request_hash, response/result) with a unique constraint; if the same key appears with a different payload, return a conflict rather than silently creating a second ledger movement.
Outbox pattern is safer than “write database, then publish event.” In the same transaction as the ledger write, insert an outbox row; a relay later publishes to Kafka, SQS, or Pub/Sub. Consumers deduplicate by event id, giving at-least-once delivery without losing ledger facts.
Distributed transactions should be avoided across internal ledger state and external services. If integrating with a routing, bank, or card-network API, use a state machine: PENDING → CONFIRMED or PENDING → FAILED, plus retries, reconciliation jobs, and compensating entries instead of pretending two-phase commit exists across the internet.
Activity counters are usually derived views over events, not source-of-truth records. A robust design ingests events, deduplicates by event id, partitions by entity and time bucket, and stores aggregates like (counter_id, bucket_start, granularity, count) in Redis, DynamoDB, Cassandra, or Bigtable.
Tumbling windows use fixed buckets such as one row per minute: bucket = $\lfloor timestamp / window\_size \rfloor \times window\_size$ . Sliding windows are computed by summing recent smaller buckets, e.g. last 5 minutes = five 1-minute buckets, trading read amplification for simple writes.
Hot-key sharding is necessary when one merchant, location, or activity becomes extremely popular. Write to shards like (counter_id, bucket, shard_id) where shard_id = hash(event_id) % S; reads sum across S shards. Increase S when per-key QPS exceeds a partition’s safe write rate.
Deduplication for counters is bounded by the time horizon. Use a durable event-id table for strict correctness, or a Redis SET/Bloom filter with TTL for near-real-time counters. Bloom filters reduce memory but introduce false positives, undercounting by roughly the configured false-positive rate.
Consistency tradeoffs should be explicit. A dashboard counter can tolerate seconds of lag and eventual convergence; an authorization limit or ledger balance usually cannot. Name the SLA: for example, “counter visible within 5 seconds, exact after backfill,” versus “ledger write committed once and visible immediately.”
Observability and reconciliation are part of the design, not afterthoughts. Track ingestion lag, dedup hit rate, dropped events, outbox backlog, ledger imbalance count, and reconciliation mismatches. For ledgers, a daily job should recompute balances from entries and alert on any non-zero imbalance.

Worked example

For Design ledger and bikemap integration, a strong candidate should first clarify the invariant: “Is the ledger the source of truth for billable activity, and is the external bikemap service authoritative only for route distance or pricing inputs?” Then declare assumptions: ledger writes must be strongly consistent and auditable; bikemap calls may fail, time out, or return changed routes; users may retry requests. Organize the answer around four pillars: data model, write path, external integration workflow, and reconciliation/observability.

The data model should separate immutable ledger entries from trip or route metadata: a trips table can hold route state, while a ledger_entries table records balanced debits and credits with an idempotency key. The write path should avoid a distributed transaction between the ledger database and bikemap; instead, create a PENDING_ROUTE or PENDING_CHARGE state, call the external service via an async worker, then commit a balanced ledger transaction when the required data is available. A specific tradeoff to flag is synchronous versus asynchronous integration: synchronous gives lower latency for the user if bikemap is healthy, but asynchronous state transitions are more reliable under timeouts and retries. If the external service returns different results on retry, store the request parameters and versioned response used for billing so the ledger remains explainable. Close by saying that, with more time, you would add reconciliation jobs, webhook replay handling if the service is event-driven, and security controls such as signed requests, scoped credentials, and audit logs.

A second angle

For Design a local activity counter service, the same discipline applies, but the correctness bar moves from financial invariants to bounded error and freshness. Instead of designing a serializable double-entry write path, frame the system as event ingestion, deduplication, bucketed aggregation, hot-key mitigation, and query serving. The interviewer may push on tumbling versus sliding windows, so explain that one-minute buckets can serve both hourly tumbling counts and “last 15 minutes” sliding counts by summing recent buckets. Idempotency still matters, but duplicate events affect counts rather than money movement, so you can choose between exact dedup storage and TTL-based approximate dedup depending on SLA. The key transfer is knowing which state is authoritative and which state is a derived, eventually consistent view.

Common pitfalls

Pitfall: Treating counters and ledgers as the same consistency problem.

A tempting answer is “put all events in Kafka and aggregate later,” but that is not sufficient for financial correctness. For a ledger, the committed database record is the source of truth; streams and counters are downstream projections, not the authority for whether money moved.

Pitfall: Hand-waving idempotency as “we’ll retry safely.”

Retries are only safe if the system can recognize the same logical operation after a client timeout, worker crash, or duplicate message. A stronger answer names the unique key, where it is stored, how payload mismatches are handled, and whether the prior response is replayed to the caller.

Pitfall: Over-indexing on technology before invariants.

Jumping straight to Kafka, Redis, and Cassandra can sound scalable but shallow. Start with invariants and access patterns: exact balance checks, append-only audit trails, write QPS, read freshness, window sizes, and hot keys; then choose storage and queues that support those constraints.

Connections

Interviewers often pivot from this area into event sourcing, CQRS, rate limiting, distributed locking, stream processing, and database isolation levels. Be ready to compare Postgres serializable transactions with consensus-backed stores, and to explain when approximate structures like Count-Min Sketch or Bloom filters are acceptable.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts