Design personalized news aggregation service
Company: Rippling
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Technical Screen
# System Design: Personalized News Aggregation Service
Design a large-scale news aggregation system similar to Google News or other news aggregator products.
The key functional requirements are:
- The system should **collect news articles** from many different news providers (e.g., CNN, BBC, local newspapers) using:
- Web crawlers (for sites without APIs).
- RSS feeds or publisher APIs when available.
- The system should **normalize and store** collected articles with consistent metadata:
- Title, body, URL, publish time, source, author.
- Category (e.g., politics, sports, tech) and language.
- The system should support **logged-in users** who have:
- Subscriptions to specific publishers/sources.
- Category/topic preferences (e.g., more sports, less politics).
- For each logged-in user, the system should display a **personalized news feed**, taking into account:
- User’s subscriptions.
- User’s category/topic preferences.
- Freshness and popularity of articles.
Non-functional requirements and constraints (you may make reasonable assumptions, but be explicit):
- Large scale: potentially tens of millions of daily active users.
- High read throughput: most requests are for reading the news feed.
- Reasonable freshness: new articles should appear in user feeds within a few minutes of being published.
- High availability and low latency for feed retrieval (e.g., p95 < 200–300 ms).
In your design, cover at least the following aspects:
1. **Requirements and APIs**
- Clarify functional and non-functional requirements.
- Define main APIs or endpoints for clients (web/mobile) to fetch the news feed and manage preferences.
2. **High-level Architecture**
- Major components and services (e.g., crawler, content ingestion pipeline, storage, feed/personalization service).
- How data flows from publishers to the end-user feed.
3. **Data Storage and Indexing**
- How you will store articles, metadata, and user preferences.
- How to support efficient querying (by category, recency, popularity, user interests).
4. **Crawling & Ingestion Pipeline**
- How crawlers/RSS/API consumers are scheduled and scaled.
- How content is parsed, deduplicated, categorized, and filtered.
5. **Personalization & Ranking**
- How to build a personalized feed based on user subscriptions and category preferences.
- Basic ranking logic (you can assume heuristic or ML-based ranking, but describe the approach conceptually).
6. **Scalability, Caching, and Availability**
- Strategies to handle high read traffic and keep latency low.
- Use of caching, CDNs, sharding, and replication.
7. **Freshness, Consistency, and Trade-offs**
- How to balance freshness of news with system load and cache efficiency.
- Any relevant consistency or CAP-theorem trade-offs you would make.
Explain your design step-by-step and justify key trade-offs.