Design ad clickstream analytics pipeline
Company: Amazon
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Onsite
##### Question
Design an end-to-end advertising clickstream ingestion and analytics platform that ingests events through Kafka, stores raw and curated data in S3, and supports interactive queries with Presto. Cover the following:
1. **Ingestion (Kafka):** Define the topic partitioning strategy, message schema and serialization (with a schema registry), partition keys, ordering and delivery semantics, and consumer groups for multiple use cases (e.g., real-time CTR metrics within one minute *and* batch ETL to S3).
2. **Storage (S3 data lake):** Define the S3 layout and partitioning strategy for efficient Presto queries (e.g., by date / hour / campaign), the raw vs. curated zone separation, catalog and schema management, file formats, and compaction.
3. **Query engine (Presto):** Describe catalog/metastore integration, partition pruning, summary/materialized tables, and query optimizations.
4. **Scale & resilience:** Scale the system to 1M+ events/second. Address backpressure handling, broker outages, consumer restarts, exactly-once vs. at-least-once trade-offs, failover and recovery, data reprocessing/backfills, and late/out-of-order events.
5. **Governance & cost:** Cover schema evolution, PII governance, and cost optimization.
6. **Trade-offs:** Compare real-time vs. batch processing and explain where each is appropriate.
Quick Answer: This Amazon system design question asks the candidate to design an end-to-end ad clickstream analytics platform: Kafka ingestion, an S3 data lake with raw and curated zones, and interactive Presto queries. It evaluates partitioning and serialization, real-time CTR and batch ETL, scaling to 1M+ events/second, exactly-once vs at-least-once trade-offs, failure recovery, schema evolution, PII governance, and cost optimization.