PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Amazon

Design ad clickstream analytics pipeline

Last updated: Jun 15, 2026

Quick Overview

This Amazon system design question asks the candidate to design an end-to-end ad clickstream analytics platform: Kafka ingestion, an S3 data lake with raw and curated zones, and interactive Presto queries. It evaluates partitioning and serialization, real-time CTR and batch ETL, scaling to 1M+ events/second, exactly-once vs at-least-once trade-offs, failure recovery, schema evolution, PII governance, and cost optimization.

  • hard
  • Amazon
  • System Design
  • Software Engineer

Design ad clickstream analytics pipeline

Company: Amazon

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Onsite

##### Question Design an end-to-end advertising clickstream ingestion and analytics platform that ingests events through Kafka, stores raw and curated data in S3, and supports interactive queries with Presto. Cover the following: 1. **Ingestion (Kafka):** Define the topic partitioning strategy, message schema and serialization (with a schema registry), partition keys, ordering and delivery semantics, and consumer groups for multiple use cases (e.g., real-time CTR metrics within one minute *and* batch ETL to S3). 2. **Storage (S3 data lake):** Define the S3 layout and partitioning strategy for efficient Presto queries (e.g., by date / hour / campaign), the raw vs. curated zone separation, catalog and schema management, file formats, and compaction. 3. **Query engine (Presto):** Describe catalog/metastore integration, partition pruning, summary/materialized tables, and query optimizations. 4. **Scale & resilience:** Scale the system to 1M+ events/second. Address backpressure handling, broker outages, consumer restarts, exactly-once vs. at-least-once trade-offs, failover and recovery, data reprocessing/backfills, and late/out-of-order events. 5. **Governance & cost:** Cover schema evolution, PII governance, and cost optimization. 6. **Trade-offs:** Compare real-time vs. batch processing and explain where each is appropriate.

Quick Answer: This Amazon system design question asks the candidate to design an end-to-end ad clickstream analytics platform: Kafka ingestion, an S3 data lake with raw and curated zones, and interactive Presto queries. It evaluates partitioning and serialization, real-time CTR and batch ETL, scaling to 1M+ events/second, exactly-once vs at-least-once trade-offs, failure recovery, schema evolution, PII governance, and cost optimization.

Related Interview Questions

  • Design a Log Collection System - Amazon (medium)
  • Design Human Avoidance for Warehouse Robots - Amazon (medium)
  • Design a High-Availability Load Balancer - Amazon (hard)
  • Design a Ride-Hailing Matching System - Amazon (medium)
  • Design a cloud database write path and recovery - Amazon (hard)
Amazon logo
Amazon
Jul 31, 2025, 12:00 AM
Software Engineer
Onsite
System Design
6
0
Question

Design an end-to-end advertising clickstream ingestion and analytics platform that ingests events through Kafka, stores raw and curated data in S3, and supports interactive queries with Presto. Cover the following:

  1. Ingestion (Kafka): Define the topic partitioning strategy, message schema and serialization (with a schema registry), partition keys, ordering and delivery semantics, and consumer groups for multiple use cases (e.g., real-time CTR metrics within one minute and batch ETL to S3).
  2. Storage (S3 data lake): Define the S3 layout and partitioning strategy for efficient Presto queries (e.g., by date / hour / campaign), the raw vs. curated zone separation, catalog and schema management, file formats, and compaction.
  3. Query engine (Presto): Describe catalog/metastore integration, partition pruning, summary/materialized tables, and query optimizations.
  4. Scale & resilience: Scale the system to 1M+ events/second. Address backpressure handling, broker outages, consumer restarts, exactly-once vs. at-least-once trade-offs, failover and recovery, data reprocessing/backfills, and late/out-of-order events.
  5. Governance & cost: Cover schema evolution, PII governance, and cost optimization.
  6. Trade-offs: Compare real-time vs. batch processing and explain where each is appropriate.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Amazon•More Software Engineer•Amazon Software Engineer•Amazon System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.