How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Technical Screen rounds at Rippling.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Rippling during technical interviews.

Design a News Aggregation System (Google News-style)

Q: Design a News Aggregation System (Google News-style)

This question assesses a candidate's ability to design a large-scale news aggregation system, covering distributed crawling, article deduplication, event clustering, and real-time trending detection. It is a common system design interview topic used to evaluate architectural reasoning about data pipelines, freshness, consistency, and latency trade-offs at scale.

Design a large-scale news aggregation service similar to Google News. The system continuously discovers and crawls news articles from tens of thousands of publishers across the web, groups articles that cover the same real-world event into a single story, surfaces trending / hot topics in near real time, and serves a ranked, deduplicated news feed to readers.

The interview focuses less on classic full-text search and more on the ingestion and processing pipeline — crawling at scale, coordinating a fleet of crawler machines and keeping it alive when nodes fail, deduplicating and clustering articles into stories, detecting hot topics, and choosing the right storage for each dataset — plus how those choices affect freshness, consistency, and latency.

Constraints & Assumptions

Sources: ~50,000–100,000 news sources / feeds (RSS, sitemaps, publisher APIs, and raw HTML).
Ingest volume: ~2–5 million new articles per day (~30–60 articles/sec average; 5–10x bursts during major breaking events).
Readers: ~50M daily active users, each loading a feed a few times a day → tens of thousands of feed reads/sec at peak.
Freshness: breaking news should appear in a reader's feed within ~2–5 minutes of publication.
Latency: p95 feed read < ~200 ms.
Politeness: respect robots.txt and per-domain crawl-rate limits.
Storage: retain the article corpus for years (archive + serve recent feed).
Assume we store article metadata + a snippet/link (publisher-hosted content), not full reproductions.

Clarifying Questions to Ask

Scope: global multi-language and multi-region, or a single language/region to start?
Do we need personalized feeds (per-user ranking) or a single editorial/region feed in v1?
Is full-text search in scope, or only the browse/feed + trending experience?
Do we host article text or only link out with a snippet (this changes storage and copyright handling)?
How fresh must trending be — sub-minute, or a few minutes acceptable?
Multimedia (images/video) in scope, or text articles only for now?

Part 1 — Distributed crawling & fleet coordination

Design the crawl/ingestion subsystem that discovers and fetches new articles from tens of thousands of sources with low latency, while respecting per-domain politeness. The fleet is many worker machines that must divide the source list among themselves. Explain how work is partitioned, how you avoid two workers fetching the same source, how you schedule recrawls, and — critically — how the fleet stays coordinated when the coordinator node or a worker dies (leader election, membership, failure detection, rebalancing).

Clarifying Questions for this Part

Can we consume push/pull feeds (RSS, sitemaps, PubSubHubbub / WebSub) for most sources, or must we crawl raw HTML and discover article URLs ourselves?
Is there an SLA difference between "tier-1" high-value sources (crawl every few seconds) and the long tail (crawl hourly)?

What This Part Should Cover Premium

Part 2 — Processing: dedup, clustering into stories, and embeddings

Once raw articles arrive, design the pipeline that normalizes them, detects near-duplicates across syndicating sources, and clusters articles covering the same event into one story. The pipeline computes vector embeddings used for similarity. Address the trade-off the interviewer probed directly: should you persist the article to the database first and then trigger embedding asynchronously, or compute the embedding inline before the write? Reason about consistency, latency, durability, and accuracy.

What This Part Should Cover Premium

Design how the system detects hot / trending stories in near real time — e.g., a breaking event whose coverage and reader interest suddenly spike. Define what makes a topic "hot" and how you compute it at scale.

What This Part Should Cover Premium

Part 4 — Storage choices, data model, and serving

Choose storage for the major datasets — raw articles & metadata, story clusters, embeddings, engagement/trending counters, and the served feed. The interviewer rejected "both a relational and a document DB would work — pick one and justify it." Make concrete per-dataset choices and justify them by access pattern. Then sketch the read/serving path that builds a deduplicated, ranked feed with low latency, and note where full-text/vector search would fit if added.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

How do you guarantee exactly-once-ish processing from crawl to store given retries and worker crashes (idempotency keys, dedup, outbox)?
A single major event 100x's article volume in minutes — where does the pipeline break first, and how do you protect it (backpressure, autoscaling, load shedding, priority queues)?
How would you cluster the same event across languages to add multilingual support?
How do you keep each reader's feed fresh without recomputing ranking on every request (incremental updates, cache invalidation when a new high-rank article lands)?

Constraints & Assumptions

Sources: ~50,000–100,000 news sources / feeds (RSS, sitemaps, publisher APIs, and raw HTML).
Ingest volume: ~2–5 million new articles per day (~30–60 articles/sec average; 5–10x bursts during major breaking events).
Readers: ~50M daily active users, each loading a feed a few times a day → tens of thousands of feed reads/sec at peak.
Freshness: breaking news should appear in a reader's feed within ~2–5 minutes of publication.
Latency: p95 feed read < ~200 ms.
Politeness: respect robots.txt and per-domain crawl-rate limits.
Storage: retain the article corpus for years (archive + serve recent feed).
Assume we store article metadata + a snippet/link (publisher-hosted content), not full reproductions.

Clarifying Questions to Ask

Scope: global multi-language and multi-region, or a single language/region to start?
Do we need personalized feeds (per-user ranking) or a single editorial/region feed in v1?
Is full-text search in scope, or only the browse/feed + trending experience?
Do we host article text or only link out with a snippet (this changes storage and copyright handling)?
How fresh must trending be — sub-minute, or a few minutes acceptable?
Multimedia (images/video) in scope, or text articles only for now?

Part 1 — Distributed crawling & fleet coordination

Clarifying Questions for this Part

Can we consume push/pull feeds (RSS, sitemaps, PubSubHubbub / WebSub) for most sources, or must we crawl raw HTML and discover article URLs ourselves?
Is there an SLA difference between "tier-1" high-value sources (crawl every few seconds) and the long tail (crawl hourly)?

What This Part Should Cover Premium

Part 2 — Processing: dedup, clustering into stories, and embeddings

What This Part Should Cover Premium

Part 4 — Storage choices, data model, and serving

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

How do you guarantee exactly-once-ish processing from crawl to store given retries and worker crashes (idempotency keys, dedup, outbox)?
A single major event 100x's article volume in minutes — where does the pipeline break first, and how do you protect it (backpressure, autoscaling, load shedding, priority queues)?
How would you cluster the same event across languages to add multilingual support?
How do you keep each reader's feed fresh without recomputing ranking on every request (incremental updates, cache invalidation when a new high-rank article lands)?

Design a News Aggregation System (Google News-style)

Quick Overview

Design a News Aggregation System (Google News-style)

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Distributed crawling & fleet coordination

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 2 — Processing: dedup, clustering into stories, and embeddings

What This Part Should Cover Premium

What This Part Should Cover Premium

Part 4 — Storage choices, data model, and serving

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP

Design a News Aggregation System (Google News-style)

Quick Overview

Design a News Aggregation System (Google News-style)

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Distributed crawling & fleet coordination

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 2 — Processing: dedup, clustering into stories, and embeddings

What This Part Should Cover Premium

What This Part Should Cover Premium

Part 4 — Storage choices, data model, and serving

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP

Design a News Aggregation System (Google News-style)

Quick Overview