Design a log filtering and analytics service
Company: Amazon
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
##### Question
Design a log-processing service that ingests application logs at scale and supports the following capabilities:
1. **Filter logs by attributes** — e.g., service/component, level, host, and a substring or regex pattern on the message, scoped to a time range. Expose this as `filter(query)`.
2. **Count error-level logs over a time window** — return the number of ERROR (or higher) logs over a specified window, with optional predicates. Expose this as `countErrors(window)`.
3. **Build an hourly histogram** for a specific log pattern, message predicate, or log ID over a window — returning a count per hour bucket. Expose this as `histogramByHour(query, window)`.
In your design, specify:
- The **ingestion API and flow** (transport, batching, validation, enrichment, idempotency).
- **Storage and indexing choices** — time-series/OLAP partitioning, inverted indexes for substring/regex, and storage tiering (hot/warm/cold).
- The **query API** and how each call (`filter`, `countErrors`, `histogramByHour`) is planned and routed.
- **Schema design** with example fields.
- **Handling of late and duplicated / out-of-order events** (watermarks, deduplication).
- **Aggregation strategies** (on-write rollups vs on-read aggregation, caching).
- **Scalability, partitioning, and retention**.
- **Correctness vs latency / performance trade-offs**.
- **Complexity analysis** for the common queries.
Provide complexity estimates (big-O and practical latency) for the common queries.
Quick Answer: An Amazon software engineer system design question: design a log filtering and analytics service that ingests high-volume application logs and supports attribute/substring filtering, error counts over a time window, and hourly histograms by pattern or ID. It tests ingestion API design, dual search + OLAP storage with hourly rollups, schema and indexing, late/duplicate-event handling, partitioning and retention, and correctness-vs-latency trade-offs with complexity analysis.