This question evaluates a candidate's understanding of distributed systems and data engineering concepts including streaming ingestion (Kafka), object storage and data lake layout (S3), query-engine integration (Presto), scalability, fault-tolerance, schema evolution, deduplication, and analytics requirements for high-volume ad clickstream data.
You are asked to design an end-to-end advertisement clickstream platform that ingests events from web/mobile, persists raw data durably, and supports interactive analytics. The stack must use Kafka for ingestion, S3 for raw storage, and Presto for queries. The system should scale to heavy traffic, be fault-tolerant, and balance real-time and batch needs.
Assume the following to make the problem concrete:
Design the platform and cover:
Login required