Design ad clickstream analytics pipeline
Company: Amazon
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Onsite
Design an end-to-end advertising clickstream system that ingests events via Kafka, stores raw and curated data in S3, and supports interactive analytics with Presto. Specify topic partitioning strategy, message schema and serialization, ordering and delivery semantics, and consumer groups for multiple use cases (e.g., real-time metrics like CTR within one minute and batch ETL to S
3). Define S3 layout and partitioning strategy for efficient Presto queries (e.g., by date/hour/campaign), catalog and schema management, and compaction. Address scaling to 1M+ events/second, backpressure handling, exactly-once vs at-least-once trade-offs, failure scenarios and recovery (broker outages, consumer restarts), data reprocessing, schema evolution, PII governance, and cost optimization. Compare real-time vs batch processing trade-offs and where each is appropriate.
Quick Answer: This question evaluates a candidate's ability to design large-scale streaming and data lake architectures, testing competencies in Kafka-based ingestion, S3 data layout and partitioning, real-time and batch analytics (CTR and dashboards), reliability and failure recovery, schema management, governance, and cost trade-offs.