System Design: Ad Clickstream Ingestion and Analytics with Kafka, S3, and Presto
Context
You are asked to design an end-to-end advertisement clickstream platform that ingests events from web/mobile, persists raw data durably, and supports interactive analytics. The stack must use Kafka for ingestion, S3 for raw storage, and Presto for queries. The system should scale to heavy traffic, be fault-tolerant, and balance real-time and batch needs.
Assume the following to make the problem concrete:
-
Scale: 5–10 billion events/day, average 1 KB payload, peak 150k events/sec (burst 2× for minutes).
-
Latency SLOs:
-
Real-time monitoring: fresh data available for exploratory queries within 2–5 minutes.
-
Batch analytics: hourly/daily rollups with strong correctness.
-
Queries: session funnels, CTR, geo/device breakdowns, campaign attribution, and ad-hoc exploration.
-
Availability: multi-AZ; data durability across region-level incidents is a plus.
Requirements
Design the platform and cover:
-
End-to-end architecture and data flow from producers to analytics.
-
Storage layout and file formats in S3 for efficient Presto queries.
-
How Presto is configured to query the lake (metastore/catalogs, partitioning, schema evolution).
-
Scaling strategies for Kafka, streaming/batch jobs, S3 layout, and Presto.
-
Failover and recovery mechanisms across components.
-
Data quality, schema management, and deduplication.
-
Trade-offs between real-time and batch processing (freshness, cost, correctness).