PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Uber

Design a Distributed Logging System

Last updated: Jun 24, 2026

Quick Overview

This question evaluates a candidate's ability to architect a large-scale distributed logging system, testing competency in data ingestion pipelines, fault tolerance, and storage tiering. It assesses practical system design skills around scalability, durability guarantees, and latency trade-offs common in senior software engineering interviews.

  • medium
  • Uber
  • System Design
  • Software Engineer

Design a Distributed Logging System

Company: Uber

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Onsite

# Design a Distributed Logging System You are asked to design a **distributed logging system** for a large microservices platform (think of a ride-hailing company running thousands of service instances across many regions). Every service emits log lines — structured events such as request traces, errors, warnings, and audit records — and these logs must be reliably collected, transported, stored, indexed, and made searchable for engineers and on-call responders. The system should ingest logs from tens of thousands of hosts, survive bursts and partial outages without losing data it has acknowledged, let an engineer search recent logs in seconds, and retain older logs cheaply for compliance. Walk through the end-to-end architecture: how a log line travels from an application process to a searchable index, how the system scales and stays available, and how you keep cost under control. ### Constraints & Assumptions Use these as working numbers; state any you change. - **Scale:** ~50,000 service instances across 5 regions; aggregate steady-state ingest of ~2,000,000 log events/sec, with peaks up to 4x during incidents. - **Event size:** average ~500 bytes per structured event after serialization, so ~1 GB/sec steady-state (~85 TB/day) before compression. - **Latency targets:** a log line should be searchable within ~30 seconds of emission (p99). Search queries over the last 24 hours should return in a few seconds. - **Retention:** hot/searchable retention 7–14 days; warm/archival retention up to 1 year for a subset (audit/compliance), stored cheaply. - **Durability:** once the collection tier acknowledges a batch, it must not be silently lost (at-least-once delivery is acceptable; duplicates must be tolerable downstream). - **Availability:** ingest must keep accepting logs during single-AZ or single-broker failures; a regional outage may degrade search for that region but must not lose acknowledged data. - **Multi-tenancy:** many teams share the system; one noisy service must not starve others. ### Clarifying Questions to Ask - What is the **read/write ratio and query pattern** — mostly recent tail-following and incident debugging (high write, bursty point/range reads), or heavy analytical aggregation over long windows? - What **delivery guarantee** does the business need: best-effort (drop under extreme load), at-least-once (durable, dedup downstream), or exactly-once (much costlier)? - Are logs **structured** (JSON/protobuf with known fields) or arbitrary free text, and do we control the client logging library so we can enforce a schema and sampling? - What are the **compliance/PII** requirements — must certain fields be redacted at the edge, encrypted at rest, access-controlled, and provably retained/deleted on a schedule? - What is the **cost envelope** and how do we trade off hot-index size vs. cheaper object-store archival vs. sampling/aggregation? - Who are the **consumers** besides humans — alerting/anomaly-detection pipelines, metrics derivation, security/SIEM — and do they need the raw stream or aggregates? ### Part 1 — Ingestion and transport (the write path) Design how a log line gets from an application process to a durable buffer that decouples producers from storage. Cover the on-host agent, batching/back-pressure, the transport/broker tier, partitioning, and the durability guarantee at acknowledgment time. ```hint Decoupling A buffering tier between producers and the indexing/storage backend is the key idea: it absorbs bursts, lets storage fall behind temporarily, and turns durability into "ack means persisted to a replicated log." Think about what technology gives you a partitioned, replicated, append-only log. ``` ```hint Back-pressure and loss Decide what happens when the buffer is full or a downstream is down: block, spill to a local on-disk queue on the host, sample/drop by priority, or shed load. Tie the chosen behavior back to the delivery guarantee you committed to. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 2 — Storage, indexing, and tiering Design how logs are consumed from the buffer, indexed for fast search, and tiered from hot search storage to cheap long-term archive. Cover the indexing layer, the data model / time partitioning, and the lifecycle from hot to warm to cold. ```hint Two storage roles Separate "searchable hot index" (an inverted-index search engine sized for days of data) from "cheap durable archive" (compressed objects in object storage). Consumers fan the broker stream into both. Time-based indices/partitions make retention and rollover a metadata operation. ``` ```hint Index cost Indexing every field is expensive at this volume. Consider indexing a curated set of fields (timestamp, service, level, trace_id, host) plus full-text on the message, sampling low-value debug logs, and rolling indices by time so old ones can be force-merged, frozen, or deleted cheaply. ``` #### Clarifying Questions for this Part - Do queries need **full-text search** over message bodies, or is filtering by structured fields (service, level, trace_id, time range) sufficient for most use cases? - What fraction of ingested volume must be **searchable in the hot tier** vs. only retrievable from archive on demand? #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 3 — Scale, availability, and operations Show how the design scales horizontally, stays available under failures, prevents one tenant from harming others, and is observable. Cover regional topology, hot-partition/back-pressure handling, multi-tenancy isolation, and how you monitor the pipeline itself. ```hint Don't let the logging system page on itself The pipeline must be observable without depending on the very index it feeds. Track per-stage lag (broker consumer lag, indexing latency), drop counters, and per-tenant ingest rates with an independent metrics path so a logging outage is visible. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - A single high-cardinality field (e.g. a unique request URL with embedded IDs) is blowing up your index size and slowing queries. How do you detect this and what are your options (field exclusion, normalization, sampling, dedicated index)? - An engineer needs to trace a single request across 30 services. What must be present in the logging schema and the transport path for this to work, and how does it interact with sampling? - A region goes fully offline for 2 hours and comes back. Walk through what happens to the logs produced there during the outage and how (or whether) they become searchable afterward. (References Part 1's durability guarantee and Part 3's regional topology.) - The business asks to cut logging cost by 40% without losing the ability to debug production incidents. What concrete levers do you pull, and how do you measure that debugging capability is preserved?

Quick Answer: This question evaluates a candidate's ability to architect a large-scale distributed logging system, testing competency in data ingestion pipelines, fault tolerance, and storage tiering. It assesses practical system design skills around scalability, durability guarantees, and latency trade-offs common in senior software engineering interviews.

Related Interview Questions

  • Design a Food-Delivery Backend (Uber Eats-style) - Uber (medium)
  • Design a Real-Time Chat System - Uber (medium)
  • Design a Stock Trading Platform - Uber (medium)
  • Design an Uber Eats Cart Service - Uber (medium)
  • Design A URL Shortener - Uber (medium)
Uber logo
Uber
Jun 8, 2026, 12:00 AM
Software Engineer
Onsite
System Design
0
0

Design a Distributed Logging System

You are asked to design a distributed logging system for a large microservices platform (think of a ride-hailing company running thousands of service instances across many regions). Every service emits log lines — structured events such as request traces, errors, warnings, and audit records — and these logs must be reliably collected, transported, stored, indexed, and made searchable for engineers and on-call responders.

The system should ingest logs from tens of thousands of hosts, survive bursts and partial outages without losing data it has acknowledged, let an engineer search recent logs in seconds, and retain older logs cheaply for compliance. Walk through the end-to-end architecture: how a log line travels from an application process to a searchable index, how the system scales and stays available, and how you keep cost under control.

Constraints & Assumptions

Use these as working numbers; state any you change.

  • Scale: ~50,000 service instances across 5 regions; aggregate steady-state ingest of ~2,000,000 log events/sec, with peaks up to 4x during incidents.
  • Event size: average ~500 bytes per structured event after serialization, so ~1 GB/sec steady-state (~85 TB/day) before compression.
  • Latency targets: a log line should be searchable within ~30 seconds of emission (p99). Search queries over the last 24 hours should return in a few seconds.
  • Retention: hot/searchable retention 7–14 days; warm/archival retention up to 1 year for a subset (audit/compliance), stored cheaply.
  • Durability: once the collection tier acknowledges a batch, it must not be silently lost (at-least-once delivery is acceptable; duplicates must be tolerable downstream).
  • Availability: ingest must keep accepting logs during single-AZ or single-broker failures; a regional outage may degrade search for that region but must not lose acknowledged data.
  • Multi-tenancy: many teams share the system; one noisy service must not starve others.

Clarifying Questions to Ask

  • What is the read/write ratio and query pattern — mostly recent tail-following and incident debugging (high write, bursty point/range reads), or heavy analytical aggregation over long windows?
  • What delivery guarantee does the business need: best-effort (drop under extreme load), at-least-once (durable, dedup downstream), or exactly-once (much costlier)?
  • Are logs structured (JSON/protobuf with known fields) or arbitrary free text, and do we control the client logging library so we can enforce a schema and sampling?
  • What are the compliance/PII requirements — must certain fields be redacted at the edge, encrypted at rest, access-controlled, and provably retained/deleted on a schedule?
  • What is the cost envelope and how do we trade off hot-index size vs. cheaper object-store archival vs. sampling/aggregation?
  • Who are the consumers besides humans — alerting/anomaly-detection pipelines, metrics derivation, security/SIEM — and do they need the raw stream or aggregates?

Part 1 — Ingestion and transport (the write path)

Design how a log line gets from an application process to a durable buffer that decouples producers from storage. Cover the on-host agent, batching/back-pressure, the transport/broker tier, partitioning, and the durability guarantee at acknowledgment time.

What This Part Should Cover Premium

Part 2 — Storage, indexing, and tiering

Design how logs are consumed from the buffer, indexed for fast search, and tiered from hot search storage to cheap long-term archive. Cover the indexing layer, the data model / time partitioning, and the lifecycle from hot to warm to cold.

Clarifying Questions for this Part

  • Do queries need full-text search over message bodies, or is filtering by structured fields (service, level, trace_id, time range) sufficient for most use cases?
  • What fraction of ingested volume must be searchable in the hot tier vs. only retrievable from archive on demand?

What This Part Should Cover Premium

Part 3 — Scale, availability, and operations

Show how the design scales horizontally, stays available under failures, prevents one tenant from harming others, and is observable. Cover regional topology, hot-partition/back-pressure handling, multi-tenancy isolation, and how you monitor the pipeline itself.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

  • A single high-cardinality field (e.g. a unique request URL with embedded IDs) is blowing up your index size and slowing queries. How do you detect this and what are your options (field exclusion, normalization, sampling, dedicated index)?
  • An engineer needs to trace a single request across 30 services. What must be present in the logging schema and the transport path for this to work, and how does it interact with sampling?
  • A region goes fully offline for 2 hours and comes back. Walk through what happens to the logs produced there during the outage and how (or whether) they become searchable afterward. (References Part 1's durability guarantee and Part 3's regional topology.)
  • The business asks to cut logging cost by 40% without losing the ability to debug production incidents. What concrete levers do you pull, and how do you measure that debugging capability is preserved?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Uber•More Software Engineer•Uber Software Engineer•Uber System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.