Design streaming mention analytics with search and alerts
Company: Bloomberg
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Onsite
## Scenario
You ingest a real-time external stream of social-media posts and news articles. Each item contains raw text and metadata (timestamp, source, author/site, etc.). The product tracks companies/stocks ("entities") and shows:
1. **Mention analytics**: how many times each entity was mentioned over time (similar to impressions/mentions).
2. **Charts by time window**: users can choose time spans from **30 minutes to multiple days** (tumbling or sliding windows are both acceptable).
3. **Latency**: charts may be delayed by **10–30 minutes**, but data must be aggregated before display.
4. **Subscriptions & notifications**: users can follow a set of entities, filter analytics to followed entities, and configure alerts (e.g., spike in mentions).
5. **Search**:
- Users can search across **hundreds of thousands of entities** (by company/stock name).
- Users can also search for the **underlying documents** (posts/articles) that mention entities.
- Search supports **any number of keywords** and filtering (e.g., entity, time range, source).
- Search load can be very high (e.g., **~100k RPS**).
6. **Spiky traffic**: must handle extreme bursts (breaking news, meme-stock events).
7. **Storage choices**: decide how to store both **processed/aggregated** data and **raw documents**.
## Task
Design a high-level system (APIs, data flow, storage, and scaling strategy) that satisfies the above requirements. Clearly explain:
- How raw streaming data is ingested, processed, and aggregated.
- How time-windowed analytics are computed and served.
- How document/entity search works at high QPS.
- How subscriptions and alerting are implemented.
- How the system remains reliable and cost-effective under spikes.
State assumptions and key trade-offs (e.g., consistency, latency, storage format, retention).
Quick Answer: This question evaluates system-design competencies such as real-time stream ingestion, stateful stream processing and time-windowed aggregation, high-throughput search indexing, alerting and subscription mechanisms, storage and retention strategy, and trade-offs around scalability, latency, consistency, and cost.