Scenario
You are designing a data platform to measure advertising performance.
Mobile apps and web browsers send ad impression and ad click events. Analysts need near-real-time dashboards and batch reports.
Requirements
-
Ingest impression/click events from
mobile and web
clients.
-
Produce aggregates such as:
-
clicks / impressions / CTR
-
grouped by time window (e.g., 1 min, 1 hour, 1 day)
-
grouped by dimensions like
campaign_id
,
ad_id
,
publisher_id
,
country
,
device_type
-
Enrich
events by joining with other data sources (examples):
-
campaign metadata (budget, objective)
-
ad metadata (creative type)
-
user/device attributes (coarse geo, OS)
-
Support both:
-
near-real-time
queries (seconds to a few minutes delay)
-
historical
queries over months
-
Event delivery constraints:
-
clients may be offline and retry
-
duplicate/out-of-order events can occur
Scale & SLOs (assume)
-
Peak 500k events/sec (impressions+clicks), average 100k/sec.
-
Dashboard freshness: P95 < 2 minutes.
-
Correctness: exactly-once is not required, but
duplicates should be minimized
and results should be explainable.
Key discussion prompt
Clients can send events:
-
one request per event, or
-
batch multiple events per request.
Explain the trade-offs between number of requests vs latency, especially for mobile networks.
Deliverables
-
High-level architecture and major components
-
Data model / schemas
-
How you do enrichment joins (stream-stream vs stream-table vs batch)
-
How you handle deduplication, late events, and backfills
-
What you store for serving (OLAP/warehouse) and for near-real-time dashboards