How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Onsite rounds at Uber.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Uber during technical interviews.

Design a Distributed Logging System | Uber Interview Question

Q: Design a Distributed Logging System

This question evaluates a candidate's ability to architect a large-scale distributed logging system, testing competency in data ingestion pipelines, fault tolerance, and storage tiering. It assesses practical system design skills around scalability, durability guarantees, and latency trade-offs common in senior software engineering interviews.

Design a Distributed Logging System

You are asked to design a distributed logging system for a large microservices platform (think of a ride-hailing company running thousands of service instances across many regions). Every service emits log lines — structured events such as request traces, errors, warnings, and audit records — and these logs must be reliably collected, transported, stored, indexed, and made searchable for engineers and on-call responders.

The system should ingest logs from tens of thousands of hosts, survive bursts and partial outages without losing data it has acknowledged, let an engineer search recent logs in seconds, and retain older logs cheaply for compliance. Walk through the end-to-end architecture: how a log line travels from an application process to a searchable index, how the system scales and stays available, and how you keep cost under control.

Constraints & Assumptions

Use these as working numbers; state any you change.

Scale: ~50,000 service instances across 5 regions; aggregate steady-state ingest of ~2,000,000 log events/sec, with peaks up to 4x during incidents.
Event size: average ~500 bytes per structured event after serialization, so ~1 GB/sec steady-state (~85 TB/day) before compression.
Latency targets: a log line should be searchable within ~30 seconds of emission (p99). Search queries over the last 24 hours should return in a few seconds.
Retention: hot/searchable retention 7–14 days; warm/archival retention up to 1 year for a subset (audit/compliance), stored cheaply.
Durability: once the collection tier acknowledges a batch, it must not be silently lost (at-least-once delivery is acceptable; duplicates must be tolerable downstream).
Availability: ingest must keep accepting logs during single-AZ or single-broker failures; a regional outage may degrade search for that region but must not lose acknowledged data.
Multi-tenancy: many teams share the system; one noisy service must not starve others.

Clarifying Questions to Ask

What is the read/write ratio and query pattern — mostly recent tail-following and incident debugging (high write, bursty point/range reads), or heavy analytical aggregation over long windows?
What delivery guarantee does the business need: best-effort (drop under extreme load), at-least-once (durable, dedup downstream), or exactly-once (much costlier)?
Are logs structured (JSON/protobuf with known fields) or arbitrary free text, and do we control the client logging library so we can enforce a schema and sampling?
What are the compliance/PII requirements — must certain fields be redacted at the edge, encrypted at rest, access-controlled, and provably retained/deleted on a schedule?
What is the cost envelope and how do we trade off hot-index size vs. cheaper object-store archival vs. sampling/aggregation?
Who are the consumers besides humans — alerting/anomaly-detection pipelines, metrics derivation, security/SIEM — and do they need the raw stream or aggregates?

Part 1 — Ingestion and transport (the write path)

Design how a log line gets from an application process to a durable buffer that decouples producers from storage. Cover the on-host agent, batching/back-pressure, the transport/broker tier, partitioning, and the durability guarantee at acknowledgment time.

What This Part Should Cover Premium

Part 2 — Storage, indexing, and tiering

Design how logs are consumed from the buffer, indexed for fast search, and tiered from hot search storage to cheap long-term archive. Cover the indexing layer, the data model / time partitioning, and the lifecycle from hot to warm to cold.

Clarifying Questions for this Part

Do queries need full-text search over message bodies, or is filtering by structured fields (service, level, trace_id, time range) sufficient for most use cases?
What fraction of ingested volume must be searchable in the hot tier vs. only retrievable from archive on demand?

What This Part Should Cover Premium

Part 3 — Scale, availability, and operations

Show how the design scales horizontally, stays available under failures, prevents one tenant from harming others, and is observable. Cover regional topology, hot-partition/back-pressure handling, multi-tenancy isolation, and how you monitor the pipeline itself.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

A single high-cardinality field (e.g. a unique request URL with embedded IDs) is blowing up your index size and slowing queries. How do you detect this and what are your options (field exclusion, normalization, sampling, dedicated index)?
An engineer needs to trace a single request across 30 services. What must be present in the logging schema and the transport path for this to work, and how does it interact with sampling?
A region goes fully offline for 2 hours and comes back. Walk through what happens to the logs produced there during the outage and how (or whether) they become searchable afterward. (References Part 1's durability guarantee and Part 3's regional topology.)
The business asks to cut logging cost by 40% without losing the ability to debug production incidents. What concrete levers do you pull, and how do you measure that debugging capability is preserved?

Design a Distributed Logging System

Constraints & Assumptions

Use these as working numbers; state any you change.

Scale: ~50,000 service instances across 5 regions; aggregate steady-state ingest of ~2,000,000 log events/sec, with peaks up to 4x during incidents.
Event size: average ~500 bytes per structured event after serialization, so ~1 GB/sec steady-state (~85 TB/day) before compression.
Latency targets: a log line should be searchable within ~30 seconds of emission (p99). Search queries over the last 24 hours should return in a few seconds.
Retention: hot/searchable retention 7–14 days; warm/archival retention up to 1 year for a subset (audit/compliance), stored cheaply.
Durability: once the collection tier acknowledges a batch, it must not be silently lost (at-least-once delivery is acceptable; duplicates must be tolerable downstream).
Availability: ingest must keep accepting logs during single-AZ or single-broker failures; a regional outage may degrade search for that region but must not lose acknowledged data.
Multi-tenancy: many teams share the system; one noisy service must not starve others.

Clarifying Questions to Ask

What is the read/write ratio and query pattern — mostly recent tail-following and incident debugging (high write, bursty point/range reads), or heavy analytical aggregation over long windows?
What delivery guarantee does the business need: best-effort (drop under extreme load), at-least-once (durable, dedup downstream), or exactly-once (much costlier)?
Are logs structured (JSON/protobuf with known fields) or arbitrary free text, and do we control the client logging library so we can enforce a schema and sampling?
What are the compliance/PII requirements — must certain fields be redacted at the edge, encrypted at rest, access-controlled, and provably retained/deleted on a schedule?
What is the cost envelope and how do we trade off hot-index size vs. cheaper object-store archival vs. sampling/aggregation?
Who are the consumers besides humans — alerting/anomaly-detection pipelines, metrics derivation, security/SIEM — and do they need the raw stream or aggregates?

Part 1 — Ingestion and transport (the write path)

What This Part Should Cover Premium

Part 2 — Storage, indexing, and tiering

Clarifying Questions for this Part

Do queries need full-text search over message bodies, or is filtering by structured fields (service, level, trace_id, time range) sufficient for most use cases?
What fraction of ingested volume must be searchable in the hot tier vs. only retrievable from archive on demand?

What This Part Should Cover Premium

Part 3 — Scale, availability, and operations

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

A single high-cardinality field (e.g. a unique request URL with embedded IDs) is blowing up your index size and slowing queries. How do you detect this and what are your options (field exclusion, normalization, sampling, dedicated index)?
An engineer needs to trace a single request across 30 services. What must be present in the logging schema and the transport path for this to work, and how does it interact with sampling?
A region goes fully offline for 2 hours and comes back. Walk through what happens to the logs produced there during the outage and how (or whether) they become searchable afterward. (References Part 1's durability guarantee and Part 3's regional topology.)
The business asks to cut logging cost by 40% without losing the ability to debug production incidents. What concrete levers do you pull, and how do you measure that debugging capability is preserved?

Design a Distributed Logging System

Quick Overview

Design a Distributed Logging System

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Ingestion and transport (the write path)

What This Part Should Cover Premium

Part 2 — Storage, indexing, and tiering

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 3 — Scale, availability, and operations

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Design a Distributed Logging System

Quick Overview

Design a Distributed Logging System

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Ingestion and transport (the write path)

What This Part Should Cover Premium

Part 2 — Storage, indexing, and tiering

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 3 — Scale, availability, and operations

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP