Design a Centralized Logging System
Company: Apple
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
# Design a Centralized Logging System
Design a system that collects application logs from a large fleet of microservice instances and makes them durable and searchable for debugging and operational monitoring. Engineers should be able to find a service's logs by time range and search the log message, within seconds of the log being emitted.
The interviewer will deep-dive four areas: the **ingestion pipeline**, the **storage and query design**, and how the system stays **reliable** and **scalable** under load and failure.
### Constraints & Assumptions
State your own numbers; reasonable starting assumptions:
- ~10,000 service instances across many services.
- Aggregate write volume on the order of ~1,000,000 log lines/sec at peak, average ~500 bytes/line, so roughly **~500 MB/s (~40 TB/day)** of raw logs. (Pick numbers and let them drive the design.)
- Logs are mostly write-once, read-rarely; reads are bursty during incidents.
- Retention: a few days to weeks "hot" (fast search), older logs archived cheaply.
- Query patterns: filter by service / host / level, restrict to a time range, and full-text search on the message; target p99 search latency of a few seconds.
- Near-real-time: a log should be queryable within a few seconds of emission.
### Clarifying Questions to Ask
- What is the expected log volume and average line size, and how spiky is it?
- Are logs structured (JSON with fields) or free-form text, or a mix?
- What retention is required, and is there a compliance/PII constraint on storage and access?
- What are the dominant query patterns — full-text search, field filters, metrics/aggregations, or alerting?
- How strict are ordering and delivery guarantees — is at-least-once with possible duplicates acceptable, or is exactly-once required?
- What is the acceptable end-to-end ingestion latency?
### Part 1 — Ingestion pipeline
Design how logs get from each service instance into the system reliably and at high throughput, with minimal impact on the services themselves.
```hint Decouple producers from consumers
Put a durable, partitioned transport log (e.g., Kafka) between the collection agents and the downstream processors so producers never block on slow consumers and traffic spikes are absorbed by the buffer.
```
```hint At the edge
Run a lightweight agent/sidecar per host that tails log files, batches and compresses lines, and buffers to local disk so a transient outage downstream does not drop logs or block the app.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Storage and query design
Design how logs are stored so they are both cheap to retain and fast to search for the required query patterns.
```hint Two stores, two jobs
Separate the cheap, immutable raw store (object storage like S3/GCS) from a query index. Index only the fields you actually filter/search on rather than indexing everything.
```
```hint Partition by time
Time-partition indices (e.g., per-hour/day, per-service) so old data rolls off cheaply and queries prune to the relevant shards. Use hot/warm/cold tiers to balance cost vs latency.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — Reliability and scalability
Design for no (or bounded) data loss, horizontal scaling of every stage, and graceful behavior under failure and overload.
```hint Durability and dedup
Replicate the buffer (replication factor, producer acks) for durability; with at-least-once delivery, make writes idempotent or dedup on a stable event id so retries do not double-count.
```
```hint Degrade, do not collapse
Under overload, shed load deliberately — sample or drop low-severity logs and apply backpressure — rather than letting the pipeline cascade into failure. Monitor consumer lag as your primary health signal.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- Is exactly-once worth the cost for logs, or is at-least-once plus best-effort dedup sufficient? How would you implement dedup?
- How do you handle multi-line entries (stack traces) and a mix of structured JSON and plain text during parsing?
- How would you support near-real-time alerting on log patterns (e.g., a sudden spike in error rate)?
- How would you run this across multiple regions — ingest locally but query globally?
Quick Answer: This question evaluates a candidate's ability to design a large-scale, distributed logging system covering ingestion, storage, indexing, and query serving. It tests system design fundamentals such as decoupling producers from consumers, partitioning and tiered storage, and reliability trade-offs under high write throughput. This type of prompt is common in system design interviews to assess practical, application-level architectural reasoning rather than purely conceptual knowledge.