Design an End-to-End Advertising Clickstream Analytics System
Context
You are designing a large-scale clickstream platform for an ad system. The platform must ingest impression and click events via Kafka, land both raw and curated data to S3, and support interactive analytics with Presto. It must deliver real-time campaign metrics (e.g., CTR within 1 minute) and batch ETL to S3 for ad-hoc exploration. Target scale is 1M+ events per second.
Requirements
-
Kafka Ingestion
-
Topic partitioning strategy (keys, partition count, skew mitigation).
-
Message schema and serialization with schema management.
-
Ordering and delivery semantics (idempotence, transactions, acks).
-
Consumer groups for multiple use cases: real-time metrics (CTR in ≤1 minute) and batch ETL to S3.
-
S3 Data Lake
-
Raw and curated zones; layout and partitioning for efficient Presto queries (e.g., by date/hour/campaign).
-
Catalog and schema management for Presto.
-
File format and compaction strategy to avoid small files.
-
Analytics and Processing
-
Real-time pipeline for CTR and other dashboards.
-
Batch ETL to curated S3 tables for interactive Presto queries.
-
Reliability and Scale
-
Scaling to 1M+ events/second; backpressure handling.
-
Exactly-once vs at-least-once trade-offs and configurations.
-
Failure scenarios and recovery (broker outages, consumer restarts).
-
Data reprocessing and schema evolution.
-
Governance and Cost
-
PII governance (separation, encryption, access control, deletion).
-
Cost optimization strategies.
-
Trade-offs
-
Compare real-time vs batch processing and when each is appropriate.