This question evaluates a Data Scientist's competency in designing real-time streaming architectures, covering event-time semantics, stateful stream processing, fault-tolerant checkpointing, partitioning and data modeling across Kafka, Flink, and downstream warehouse or lakehouse systems.
You need to design a real-time pipeline that ingests website click events via Kafka, processes them using Apache Flink, and writes queryable aggregates to a data warehouse or lakehouse for downstream analytics.
Assume the business wants near real-time (<1 minute) aggregate metrics (e.g., page views per URL, unique users, funnels) with correctness guarantees suitable for critical decisioning. Click events are append-only and can arrive out of order.
Describe the end-to-end design, addressing:
Keep the design practical and call out trade-offs and key configuration choices.
Login required