Scenario
You ingest a real-time external stream of social-media posts and news articles. Each item contains raw text and metadata (timestamp, source, author/site, etc.). The product tracks companies/stocks ("entities") and shows:
-
Mention analytics
: how many times each entity was mentioned over time (similar to impressions/mentions).
-
Charts by time window
: users can choose time spans from
30 minutes to multiple days
(tumbling or sliding windows are both acceptable).
-
Latency
: charts may be delayed by
10–30 minutes
, but data must be aggregated before display.
-
Subscriptions & notifications
: users can follow a set of entities, filter analytics to followed entities, and configure alerts (e.g., spike in mentions).
-
Search
:
-
Users can search across
hundreds of thousands of entities
(by company/stock name).
-
Users can also search for the
underlying documents
(posts/articles) that mention entities.
-
Search supports
any number of keywords
and filtering (e.g., entity, time range, source).
-
Search load can be very high (e.g.,
~100k RPS
).
-
Spiky traffic
: must handle extreme bursts (breaking news, meme-stock events).
-
Storage choices
: decide how to store both
processed/aggregated
data and
raw documents
.
Task
Design a high-level system (APIs, data flow, storage, and scaling strategy) that satisfies the above requirements. Clearly explain:
-
How raw streaming data is ingested, processed, and aggregated.
-
How time-windowed analytics are computed and served.
-
How document/entity search works at high QPS.
-
How subscriptions and alerting are implemented.
-
How the system remains reliable and cost-effective under spikes.
State assumptions and key trade-offs (e.g., consistency, latency, storage format, retention).