Design ad clickstream analytics pipeline

Q: Design ad clickstream analytics pipeline

This question evaluates a candidate's ability to design large-scale streaming and data lake architectures, testing competencies in Kafka-based ingestion, S3 data layout and partitioning, real-time and batch analytics (CTR and dashboards), reliability and failure recovery, schema management, governance, and cost trade-offs.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Design an End-to-End Advertising Clickstream Analytics System

Context

You are designing a large-scale clickstream platform for an ad system. The platform must ingest impression and click events via Kafka, land both raw and curated data to S3, and support interactive analytics with Presto. It must deliver real-time campaign metrics (e.g., CTR within 1 minute) and batch ETL to S3 for ad-hoc exploration. Target scale is 1M+ events per second.

Requirements

Kafka Ingestion
- Topic partitioning strategy (keys, partition count, skew mitigation).
- Message schema and serialization with schema management.
- Ordering and delivery semantics (idempotence, transactions, acks).
- Consumer groups for multiple use cases: real-time metrics (CTR in ≤1 minute) and batch ETL to S3.
S3 Data Lake
- Raw and curated zones; layout and partitioning for efficient Presto queries (e.g., by date/hour/campaign).
- Catalog and schema management for Presto.
- File format and compaction strategy to avoid small files.
Analytics and Processing
- Real-time pipeline for CTR and other dashboards.
- Batch ETL to curated S3 tables for interactive Presto queries.
Reliability and Scale
- Scaling to 1M+ events/second; backpressure handling.
- Exactly-once vs at-least-once trade-offs and configurations.
- Failure scenarios and recovery (broker outages, consumer restarts).
- Data reprocessing and schema evolution.
Governance and Cost
- PII governance (separation, encryption, access control, deletion).
- Cost optimization strategies.
Trade-offs
- Compare real-time vs batch processing and when each is appropriate.

Design ad clickstream analytics pipeline

Quick Overview

Design an End-to-End Advertising Clickstream Analytics System

Context

Requirements

Solution

Comments (0)