Scalable Distributed System Architecture

What's being tested

Interviewers are probing whether you can turn an ambiguous, large-scale product requirement into a scalable distributed architecture with clear APIs, data models, traffic estimates, failure handling, and operational tradeoffs. For TikTok-scale systems, a Software Engineer must reason about latency, availability, consistency, capacity, and cost under global traffic, bursty fanout, and partial failures. Strong answers show you can decompose a system into services, choose storage and messaging primitives intentionally, and explain how the design behaves at p99, not just in the happy path. The interviewer is also checking whether you can communicate architecture clearly: requirements first, then data flow, then bottlenecks, then mitigations.

Core knowledge

Requirements framing should separate functional requirements from non-functional requirements. For example: “send push/email/SMS notifications” is functional; p99 < 500ms enqueue latency, 99.99% availability, regional data residency, deduplication, and retry semantics are non-functional constraints that shape the architecture.
Back-of-the-envelope capacity planning grounds every design. Estimate requests per second with $QPS = \frac{\text{daily events}}{86{,}400} \times \text{peak factor}$ and storage with $\text{bytes/day} = \text{events/day} \times \text{avg event size} \times \text{replication factor}.$ Use peak factors like 5x–20x for notifications, ticket launches, and viral traffic.
Service decomposition should follow ownership and scaling boundaries. Typical components include an API gateway, stateless application services, durable storage, cache, message queue, worker fleet, rate limiter, and observability pipeline. Avoid over-splitting early; introduce separate services when scaling, isolation, or deployment cadence requires it.
Stateless services behind load balancers scale horizontally and simplify failover. Keep request state in Redis, MySQL, Postgres, DynamoDB, Cassandra, or object storage rather than local memory. Statelessness enables autoscaling but shifts correctness to storage, cache invalidation, and idempotency.
Data modeling depends on access patterns. Postgres or MySQL works for relational constraints and transactions; DynamoDB or Cassandra fits high-write, key-value workloads; Elasticsearch fits search; Redis fits ephemeral counters, locks, and hot reads. Always state primary keys, secondary indexes, and the most common query paths.
Consistency tradeoffs are central. Strong consistency is needed for inventory, payments, and ticket ownership; eventual consistency is acceptable for feeds, notifications, counters, and analytics-like status updates. Name the failure mode: stale reads, duplicate sends, lost updates, split-brain leadership, or conflicting writes.
Idempotency prevents duplicate side effects during retries. Use an idempotency_key, request hash, or natural unique key such as (user_id, campaign_id, channel) and store a processed record with TTL. This is critical for notification sends, order creation, reservation confirmation, and external provider callbacks.
Message queues such as Kafka, Pulsar, RabbitMQ, or cloud queues decouple producers from consumers. They absorb spikes, support retries, and allow worker autoscaling. Discuss delivery semantics: at-most-once may lose messages, at-least-once may duplicate, exactly-once is expensive and often approximated with idempotent consumers.
Partitioning and sharding control scale and hotspots. Hash by user_id for even distribution, by event_id for immutable events, or by region for data residency. Beware celebrity users, flash sales, and global announcements; these create hot keys that require fanout batching, token buckets, or precomputed shards.
Caching improves latency but adds invalidation complexity. Use CDN for static assets, Redis for hot objects and counters, and local in-process caches for small reference data. Define TTLs, cache-aside versus write-through behavior, and how the system behaves on cache miss or cache outage.
Reliability patterns include timeouts, retries with exponential backoff and jitter, circuit breakers, bulkheads, dead-letter queues, and graceful degradation. Retries must be bounded; otherwise they amplify outages. A good default is short client timeouts, retry only idempotent operations, and send poison messages to a DLQ.
Observability must include metrics, logs, and traces. Track QPS, p50/p95/p99 latency, error rate, queue lag, retry count, consumer throughput, saturation, dropped messages, duplicate rate, and provider-specific failures. Tie alerts to user impact and SLOs, not just CPU thresholds.

Worked example

For Design a global notification service, start by clarifying scope in the first 30 seconds: “Are we supporting push only, or also email and SMS? Is the requirement low-latency transactional notifications, bulk marketing campaigns, or both? Do we need regional data residency and user-level opt-out preferences?” Then declare assumptions: multi-tenant service, billions of notifications per day, at-least-once delivery internally, no duplicate visible sends when possible, and regional active-active deployment.

Organize the answer around four pillars: API surface, data model, delivery pipeline, and reliability/operations. The API might include POST /notifications, POST /campaigns, GET /status/{id}, and preference-management endpoints. The data model should include notification templates, user channel tokens, preferences, tenant quotas, send attempts, and deduplication records keyed by idempotency_key. The delivery path should be: API validates and persists request, publishes to Kafka, channel-specific workers consume, apply rate limits and preferences, call external providers such as APNs or FCM, and record delivery status asynchronously.

A key tradeoff to flag is fanout timing: fanout-on-write sends all user notifications immediately and gives low latency, but creates huge bursts; fanout-on-read or staged fanout smooths load but delays delivery and complicates status tracking. For global delivery, explain regional routing: keep user preference data in-region when required, use local queues and workers, and fail over only traffic that is legally and operationally safe to move. Close by saying: “If I had more time, I’d go deeper on provider-specific throttling, tenant isolation, disaster recovery drills, and exactly how SLOs map to alert thresholds.”

A second angle

For Design a high-volume ticketing system, the same architecture vocabulary applies, but the dominant constraint changes from asynchronous fanout to strong consistency under contention. A notification service can usually tolerate duplicate internal processing if the final send is deduped; a ticketing system cannot oversell inventory. The design should emphasize reservation records, short TTL holds, transactional decrement or compare-and-swap, queue-based admission control, and anti-hotspot strategies for popular events. Redis can help with waiting rooms or rate limiting, but the source of truth for ticket ownership needs a durable store with transactional guarantees, such as Postgres, MySQL, or a carefully designed distributed database. The strongest candidates explicitly contrast “smooth bursts with queues” versus “serialize scarce inventory updates.”

Common pitfalls

Pitfall: Jumping straight into boxes and arrows without requirements.

A tempting answer is “I’ll use Kafka, Redis, and microservices” before defining traffic, latency, consistency, or failure expectations. A stronger answer starts with scope, assumptions, core APIs, and capacity estimates, then chooses technologies because they satisfy those constraints.

Pitfall: Treating all data as equally consistent.

Many candidates say “eventual consistency is fine” for everything, which fails for ticket purchases, quota enforcement, and user preference compliance. Call out which operations need strong consistency, which can be eventually consistent, and what user-visible anomaly might occur if the system returns stale data.

Pitfall: Ignoring operational behavior after the initial design.

A design that works only when dependencies are healthy is incomplete. Discuss what happens when Redis is down, a Kafka partition lags, an external push provider throttles, or a region fails; include retries, dead-letter queues, backpressure, dashboards, and manual replay paths.

Connections

Interviewers may pivot from distributed architecture into API design, database indexing and sharding, message queue semantics, rate limiting, or incident debugging. They may also ask you to deep-dive a project you built, so prepare one real architecture where you can explain the original constraints, the bottleneck you hit, the tradeoff you chose, and what you would redesign now.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts