Caching, CDNs, and Edge Delivery at Google Scale
Asked of: Software Engineer
Last updated
What's being tested
Interviewers are probing whether you can reason about latency, availability, consistency, and cost in a geographically distributed serving system. A strong Software Engineer answer shows you understand how caches behave under real load: hot keys, invalidation races, stale reads, thundering herds, and regional failures. Google cares because many systems depend on serving data near users while protecting origin services from global traffic spikes. The goal is not to recite “use a cache,” but to design the cache hierarchy, expiration policy, correctness model, and operational safeguards.
Core knowledge
-
Cache-aside is the default backend pattern: application checks cache, falls back to the source of truth, then populates the cache. It is simple and resilient, but concurrent misses can overload the database unless you add request coalescing or per-key locks.
-
Read-through and write-through caches centralize cache logic behind a library or service. They reduce duplicated application code, but introduce another dependency in the critical path and require careful timeout handling so cache failure does not become total service failure.
-
TTL-based expiration trades freshness for simplicity. If an object has TTL , worst-case staleness is roughly unless you also support explicit invalidation. Short TTLs improve correctness but reduce hit ratio and increase origin load.
-
Hit ratio should be interpreted with traffic weight: object hit ratio counts keys, while byte hit ratio and request hit ratio better reflect load savings. A few hot objects can dominate, so optimize for weighted traffic, not just number of cached entries.
-
CDNs use a hierarchy: browser cache, edge point of presence, regional cache, and origin. Each layer reduces latency and origin traffic, but each layer also creates another place where stale or incorrectly cached content can persist.
-
Cache invalidation is harder than expiration because updates race with reads. Common approaches include versioned URLs, content hashes, soft purge, hard purge, and metadata-based revalidation using
ETagorLast-Modifiedheaders. -
Stale-while-revalidate improves tail latency by serving an expired object briefly while one worker refreshes it in the background. It is useful for static or semi-static content, but dangerous for strongly consistent data like account balances or access-control decisions.
-
Eviction policies decide what to remove under memory pressure. LRU is simple but can be polluted by scans; LFU protects frequently accessed keys; production caches often use approximations like segmented LRU, TinyLFU, or admission policies to handle large scale efficiently.
-
Hot-key mitigation matters at Google scale. A single celebrity video, search result, or config key can overwhelm one cache shard. Use replication, key salting for read-heavy values, hierarchical fanout, request collapsing, and adaptive load shedding.
-
Consistent hashing limits key movement when cache nodes are added or removed. With nodes, naive modulo hashing may remap most keys on membership change, while consistent hashing remaps about of keys, especially with virtual nodes.
-
Negative caching stores misses, such as
404responses or absent database rows, to prevent repeated expensive lookups. It needs short TTLs and careful invalidation because nonexistent objects may be created soon after. -
Observability should include
p50,p95,p99latency, hit ratio, origin request rate, eviction rate, stale-serve count, error rate, cache memory utilization, and per-key hotness. A cache that improves average latency can still worsenp99through backend miss storms.
Worked example
For Design a global CDN for static content, a strong candidate first clarifies content type, object size distribution, update frequency, target latency, consistency expectations, and traffic scale. They might state assumptions like “static images, videos, CSS, and JavaScript assets; mostly read-heavy; acceptable staleness is seconds to minutes except for purge requests.” The answer should be organized around four pillars: request routing, cache hierarchy, invalidation/freshness, and failure handling.
For request routing, describe DNS or anycast-based routing to the nearest healthy edge, with fallback to another point of presence if the local edge is overloaded. For cache hierarchy, place objects in browser cache, edge cache, regional cache, and origin storage, so edge misses do not always hit the origin. For freshness, use Cache-Control, ETag, versioned URLs such as /app.abcd1234.js, and explicit purge APIs for emergency rollback. For failure handling, explain that edge nodes should serve stale content when origin is unavailable for cacheable assets, while respecting security-sensitive headers and private content boundaries.
One tradeoff to flag explicitly is purge speed versus system cost: globally synchronous invalidation gives stronger guarantees but can be slow and expensive, while versioned object names avoid most purge complexity. A good close is: “If I had more time, I would discuss abuse prevention, multi-tenant cache isolation, regional capacity planning, and detailed metrics like edge hit ratio and origin offload.”
A second angle
For Design a distributed cache for a web service, the same concepts apply, but the constraints shift from public edge delivery to backend correctness and dependency management. Instead of Cache-Control headers and CDN purges, you would focus on cache-aside reads, database write paths, TTL selection, and invalidation after mutations. The key design question becomes whether stale data is acceptable for each entity type: user profile metadata may tolerate seconds of staleness, while permissions or billing state may not. You would also discuss sharding with consistent hashing, replication for hot keys, and graceful degradation when Memcached or Redis is slow. The interviewer is looking for whether you know when not to cache, not just how to add one.
Common pitfalls
Pitfall: Treating the cache as the source of truth.
A tempting but weak answer says, “Write the data to the cache and read from there for speed.” A stronger answer clearly separates the durable source of truth, such as a database or object store, from derived cached copies, and explains how cache entries are regenerated or invalidated.
Pitfall: Ignoring stampedes and tail latency.
Many candidates mention TTLs but miss what happens when a popular key expires and thousands of requests miss simultaneously. Better answers include request coalescing, jittered TTLs, background refresh, stale-while-revalidate, and rate limits to protect the origin.
Pitfall: Giving one universal consistency answer.
Saying “we need strong consistency” or “eventual consistency is fine” without classifying data is too shallow. Interviewers expect you to distinguish static assets, public metadata, personalized content, permissions, and financial data, then assign different caching and invalidation strategies.
Connections
Interviewers may pivot from caching into load balancing, replication, distributed consensus, object storage, or rate limiting. They may also ask you to debug a production incident involving elevated p99, low hit ratio, regional failover, or an origin overload event after a bad TTL change.
Further reading
-
Caching at Scale with Facebook’s Mcrouter — practical design ideas for large
Memcacheddeployments, routing, and failover. -
Google SRE Book, Handling Overload — useful for understanding cache miss storms, load shedding, and protecting backend services.
-
RFC 9111: HTTP Caching — authoritative reference for
Cache-Control, validators, freshness, and HTTP cache behavior.
Related concepts
- Google-Scale Search Indexing and Ranking
- Caching And Stateful Data Structure DesignCoding & Algorithms
- Storage, Indexing, APIs, And Secure ExecutionSystem Design
- Distributed Systems Consistency, Reliability, And ObservabilitySystem Design
- LLM Inference Serving, Batching, And KV Cache
- LLM Serving, Inference Scaling, KV Cache, and Latency-Cost Tradeoffs