Scalable Service And Distributed System Design

What's being tested

These prompts test whether you can turn an ambiguous product capability into a scalable distributed service with clear APIs, data models, storage choices, consistency guarantees, and failure handling. Microsoft interviewers are probing for practical engineering judgment: when to use precomputation, caching, queues, sharding, replication, idempotency, and eventual consistency instead of overbuilding a perfect but unrealistic system. They also want to see whether you can reason from requirements to bottlenecks: read/write ratio, `p99` latency, durability, fanout, freshness, multi-tenant isolation, and operational debuggability. A strong answer is not “use microservices and `Kafka`”; it is a structured tradeoff discussion that makes the system reliable under load and understandable during incidents.

Core knowledge

Clarify scale and SLOs first: expected `QPS`, read/write ratio, data size, freshness, `p50/p95/p99` latency, availability target, and geographic scope. A design for 10K users can use `Postgres`; a design for 100M users needs partitioning, asynchronous processing, and degraded modes.
Use back-of-the-envelope math to expose bottlenecks. For example, storage/day $\approx events/sec \times avg\_bytes \times 86400$ . A logging service ingesting 1M events/sec at 1KB/event writes about 86TB/day before replication and indexes.
Separate write path and read path with CQRS-style thinking when workloads differ. Logging, typeahead, and org-chart counts often need fast ingestion or updates, plus separately optimized query indexes. The write model may be normalized; the read model may be denormalized or pre-aggregated.
Choose consistency deliberately. Strong consistency is appropriate for permissions, money, and authoritative org relationships; eventual consistency is usually acceptable for typeahead rankings, notification eligibility, crawler indexes, and log search freshness. State staleness budgets explicitly, e.g. “subordinate counts may lag by 5 seconds.”
Precompute expensive aggregates when reads dominate. For org-chart subordinate counts, options include materialized counts on each manager, closure tables, nested sets, or path-prefix encodings. Reads become `O(1)` or `O(log n)`, while moves and reorgs become more expensive and require transactional care.
Use queues for smoothing and isolation. `Kafka`, `Azure Event Hubs`, or `RabbitMQ` decouple producers from consumers, absorb bursts, and enable retries. They do not magically guarantee correctness; consumers still need idempotency keys, checkpointing, dead-letter queues, and replay-safe processing.
Partition by the access pattern. Typeahead may shard by prefix or term hash; logs often shard by tenant and time; crawlers shard by host/domain to enforce politeness; notifications shard by user ID or scheduled delivery time. Bad partition keys create hot shards and uneven `p99`.
Cache only after defining invalidation. Use `Redis`, CDN edge caches, or in-process caches for hot prefixes, org counts, notification templates, or crawler robots rules. Specify TTLs, write-through versus cache-aside, stampede protection, and what happens when cached data is stale.
Design for failure as a first-class requirement. Include retries with exponential backoff and jitter, circuit breakers, timeouts, bulkheads, and graceful degradation. For example, typeahead can fall back to popular global suggestions; logging can buffer locally; notifications can retry but must avoid duplicate sends.
Indexing drives query feasibility. Prefix search can use tries, finite-state transducers, `Elasticsearch` completion suggesters, or sorted arrays with range scans. Log search needs inverted indexes plus time partitioning. Org hierarchy queries need ancestor/descendant indexes, not recursive scans over millions of rows.
Concurrency control matters for mutable state. Org reparenting, notification scheduling, and crawler frontier updates can race. Use optimistic concurrency with version numbers, compare-and-swap, database transactions, or distributed locks sparingly; prefer single-writer partition ownership where possible.
Observability is part of the design. Define service metrics such as ingest lag, queue depth, dropped events, duplicate delivery rate, cache hit rate, index freshness, crawl politeness violations, and `p99` latency. A system that cannot be debugged in production is incomplete.

Worked example

For Design a typeahead search service, a strong candidate starts by asking: “Are suggestions for documents, people, or queries? What is the corpus size? Do we need personalization? What latency target should I optimize for, say `p99 < 100ms`? How fresh must new terms be?” Then they declare reasonable assumptions: hundreds of millions of indexed terms, read-heavy traffic, globally distributed users, and suggestions based on prefix, popularity, and optional user context.

The answer can be organized around four pillars: API design, indexing, ranking, and serving architecture. The API might expose `GET /suggest?q=mic&userId=...&limit=10`, returning suggestion text, type, score, and tracking metadata. For indexing, describe building a prefix index using a trie, sorted term dictionary, or `Elasticsearch` completion suggester, with each prefix mapping to top-K candidate IDs. For ranking, keep the SWE discussion practical: combine popularity, recency, locale, and lightweight personalization features, while leaving deep model architecture out of scope.

The serving architecture should include a low-latency suggestion service, `Redis` or memory caches for hot prefixes, replicated index shards, and an offline or streaming index builder that publishes immutable index segments. A key tradeoff to call out is freshness versus latency: rebuilding compact top-K prefix indexes every few minutes gives fast reads, while real-time updates require delta indexes or write-through updates that add complexity. Close by saying that, with more time, you would cover abuse protection, multilingual tokenization, A/B-safe ranking rollout, and observability metrics like cache hit rate, index age, empty-result rate, and `p99` latency.

A second angle

For Design read-heavy org chart subordinate counts, the same scalable-service principles apply, but the dominant constraint is not low-latency prefix lookup; it is maintaining correct aggregates under hierarchy mutations. A naive recursive traversal for every manager query fails when reads are frequent and subtrees are large, so you would precompute counts or maintain an ancestor-descendant index. Unlike typeahead, freshness and correctness are more sensitive because employees and permissions may depend on reporting lines. The key tradeoff becomes whether to pay update cost during reorgs, using closure-table updates or count propagation, to make reads cheap. You should explicitly discuss concurrency: two simultaneous manager changes must not double-count or orphan employees.

Common pitfalls

Pitfall: Jumping to tools before requirements.

Saying “I’ll use `Kafka`, `Redis`, and `Cosmos DB`” without estimating traffic, latency, or consistency needs sounds shallow. A better answer starts with workload shape, then maps each requirement to a component: queue for burst absorption, cache for hot reads, durable store for source of truth, index for query latency.

Pitfall: Ignoring mutation and backfill paths.

Many candidates design the happy-path read API but forget reorgs, deletes, retries, duplicate messages, schema evolution, or index rebuilds. For logging, that means no plan for late-arriving events or retention; for notifications, duplicate sends; for org charts, corrupted counts after a move. Always describe how data changes, how work is retried, and how the system recovers.

Pitfall: Over-optimizing one dimension while hiding tradeoffs.

A fully precomputed typeahead index is fast but may be stale; strong consistency for every log write may destroy throughput; crawling as fast as possible violates politeness and gets blocked. State the tradeoff explicitly and choose based on product constraints: “I accept 1-minute index freshness to keep `p99` under 50ms.”

Connections

Interviewers can pivot from here into database design, caching strategy, message queues and stream processing, distributed locking, search indexing, or observability and incident response. They may also ask you to zoom into one component, such as designing the crawler frontier, the notification scheduler, a log query index, or a migration plan from a single-node prototype to a sharded service.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts