PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/Rippling

Design a News Aggregation System (Google News-style)

Last updated: Jul 1, 2026

Quick Overview

This question assesses a candidate's ability to design a large-scale news aggregation system, covering distributed crawling, article deduplication, event clustering, and real-time trending detection. It is a common system design interview topic used to evaluate architectural reasoning about data pipelines, freshness, consistency, and latency trade-offs at scale.

  • medium
  • Rippling
  • System Design
  • Software Engineer

Design a News Aggregation System (Google News-style)

Company: Rippling

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Technical Screen

Design a large-scale news aggregation service similar to Google News. The system continuously discovers and crawls news articles from tens of thousands of publishers across the web, groups articles that cover the same real-world event into a single **story**, surfaces **trending / hot topics** in near real time, and serves a ranked, deduplicated news feed to readers. The interview focuses less on classic full-text search and more on the **ingestion and processing pipeline** — crawling at scale, coordinating a fleet of crawler machines and keeping it alive when nodes fail, deduplicating and clustering articles into stories, detecting hot topics, and choosing the right storage for each dataset — plus how those choices affect freshness, consistency, and latency. ### Constraints & Assumptions - **Sources:** ~50,000–100,000 news sources / feeds (RSS, sitemaps, publisher APIs, and raw HTML). - **Ingest volume:** ~2–5 million new articles per day (~30–60 articles/sec average; **5–10x bursts** during major breaking events). - **Readers:** ~50M daily active users, each loading a feed a few times a day → tens of thousands of feed reads/sec at peak. - **Freshness:** breaking news should appear in a reader's feed within ~2–5 minutes of publication. - **Latency:** p95 feed read < ~200 ms. - **Politeness:** respect `robots.txt` and per-domain crawl-rate limits. - **Storage:** retain the article corpus for years (archive + serve recent feed). - Assume we store article **metadata + a snippet/link** (publisher-hosted content), not full reproductions. ### Clarifying Questions to Ask - Scope: global multi-language and multi-region, or a single language/region to start? - Do we need **personalized** feeds (per-user ranking) or a single editorial/region feed in v1? - Is full-text **search** in scope, or only the browse/feed + trending experience? - Do we host article text or only link out with a snippet (this changes storage and copyright handling)? - How fresh must **trending** be — sub-minute, or a few minutes acceptable? - Multimedia (images/video) in scope, or text articles only for now? ### Part 1 — Distributed crawling & fleet coordination Design the crawl/ingestion subsystem that discovers and fetches new articles from tens of thousands of sources with low latency, while respecting per-domain politeness. The fleet is many worker machines that must divide the source list among themselves. Explain how work is **partitioned**, how you avoid two workers fetching the same source, how you schedule **recrawls**, and — critically — how the fleet **stays coordinated when the coordinator node or a worker dies** (leader election, membership, failure detection, rebalancing). ```hint Where to start Separate "what to crawl and when" (a frontier/scheduler) from "who fetches it" (stateless workers pulling from a queue). If fetchers hold no durable ownership, adding or losing a worker doesn't reshuffle the whole assignment. ``` ```hint Coordination A single static "commander" is a single point of failure. Lean on a consensus / coordination service ($\text{ZooKeeper}$, $\text{etcd}$, or a Raft-based service) for **leader election, membership, and shard assignment**; heartbeats plus ephemeral/session nodes detect a dead worker so its shards get reassigned. Ad-hoc "round-robin across machines" doesn't solve "how do the machines know about each other." ``` ```hint Partitioning Consistent hashing of *domains* to workers keeps reassignment minimal when membership changes, and a per-domain token bucket enforces politeness regardless of which worker owns the domain. ``` #### Clarifying Questions for this Part - Can we consume push/pull feeds (RSS, sitemaps, PubSubHubbub / WebSub) for most sources, or must we crawl raw HTML and discover article URLs ourselves? - Is there an SLA difference between "tier-1" high-value sources (crawl every few seconds) and the long tail (crawl hourly)? #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 2 — Processing: dedup, clustering into stories, and embeddings Once raw articles arrive, design the pipeline that normalizes them, detects **near-duplicates** across syndicating sources, and clusters articles covering the same event into one **story**. The pipeline computes vector **embeddings** used for similarity. Address the trade-off the interviewer probed directly: should you **persist the article to the database first and then trigger embedding asynchronously**, or compute the embedding inline before the write? Reason about consistency, latency, durability, and accuracy. ```hint Clustering Represent each article as an embedding (plus extracted entities/keywords). Attach a new article to an existing story by nearest-neighbor similarity over a **recent time window** using an ANN index — i.e., online/streaming clustering, not a nightly batch re-cluster. ``` ```hint Write-then-embed Writing the article to the source-of-truth first, then emitting an event (transactional **outbox** or CDC) to an embedding worker, decouples ingest latency from model latency and guarantees durability of the raw article. The cost is a brief window where the article exists but isn't yet embedded/clustered — eventual consistency. Inline embedding gives immediate clustering but couples every ingest to model latency and failures. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 3 — Trending / hot-topic detection Design how the system detects **hot / trending** stories in near real time — e.g., a breaking event whose coverage and reader interest suddenly spike. Define what makes a topic "hot" and how you compute it at scale. ```hint Signals "Hot" is a **burst relative to a baseline**, not raw volume. Combine the velocity/acceleration of article count per story with reader engagement (clicks, dwell) and **source diversity** (many independent outlets, not one spammer). ``` ```hint Compute Maintain per-cluster counters in a streaming aggregator with sliding/decaying windows (e.g., EWMA, or count-min sketches over the stream) and rank by a burst/z-score — avoid full table scans. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 4 — Storage choices, data model, and serving Choose storage for the major datasets — raw articles & metadata, story clusters, embeddings, engagement/trending counters, and the served feed. The interviewer rejected "both a relational and a document DB would work — pick one and justify it." Make **concrete per-dataset choices** and justify them by access pattern. Then sketch the **read/serving path** that builds a deduplicated, ranked feed with low latency, and note where full-text/vector **search** would fit if added. ```hint Match store to access pattern Don't force one database on everything. Schemaless, high-write article corpus → a document/NoSQL store; embeddings → a vector/ANN index; hot trending counters and the feed → in-memory/KV; relational only where you genuinely need joins/transactions (e.g., user accounts). "Pick per dataset" is the right response to the interviewer's push. ``` ```hint Serving Keep the read path fast by precomputing/materializing ranked feeds (per region/segment) into a cache; the write path clusters and embeds asynchronously while reads serve cached, already-ranked lists. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - How do you guarantee exactly-once-ish processing from crawl to store given retries and worker crashes (idempotency keys, dedup, outbox)? - A single major event 100x's article volume in minutes — where does the pipeline break first, and how do you protect it (backpressure, autoscaling, load shedding, priority queues)? - How would you cluster the **same event across languages** to add multilingual support? - How do you keep each reader's feed fresh without recomputing ranking on every request (incremental updates, cache invalidation when a new high-rank article lands)?

Quick Answer: This question assesses a candidate's ability to design a large-scale news aggregation system, covering distributed crawling, article deduplication, event clustering, and real-time trending detection. It is a common system design interview topic used to evaluate architectural reasoning about data pipelines, freshness, consistency, and latency trade-offs at scale.

Related Interview Questions

  • Design a User Behavior Tracking (Clickstream Analytics) System - Rippling (medium)
  • Prevent Duplicate Payments Under High Load - Rippling
  • Design a personalized news aggregator - Rippling (medium)
  • Design a Scalable News Feed - Rippling (medium)
  • Design Scalable Expense Violation Processing - Rippling (hard)
|Home/System Design/Rippling

Design a News Aggregation System (Google News-style)

Rippling logo
Rippling
Jun 30, 2026, 12:00 AM
mediumSoftware EngineerTechnical ScreenSystem Design
0
0

Design a large-scale news aggregation service similar to Google News. The system continuously discovers and crawls news articles from tens of thousands of publishers across the web, groups articles that cover the same real-world event into a single story, surfaces trending / hot topics in near real time, and serves a ranked, deduplicated news feed to readers.

The interview focuses less on classic full-text search and more on the ingestion and processing pipeline — crawling at scale, coordinating a fleet of crawler machines and keeping it alive when nodes fail, deduplicating and clustering articles into stories, detecting hot topics, and choosing the right storage for each dataset — plus how those choices affect freshness, consistency, and latency.

Constraints & Assumptions

  • Sources: ~50,000–100,000 news sources / feeds (RSS, sitemaps, publisher APIs, and raw HTML).
  • Ingest volume: ~2–5 million new articles per day (~30–60 articles/sec average; 5–10x bursts during major breaking events).
  • Readers: ~50M daily active users, each loading a feed a few times a day → tens of thousands of feed reads/sec at peak.
  • Freshness: breaking news should appear in a reader's feed within ~2–5 minutes of publication.
  • Latency: p95 feed read < ~200 ms.
  • Politeness: respect robots.txt and per-domain crawl-rate limits.
  • Storage: retain the article corpus for years (archive + serve recent feed).
  • Assume we store article metadata + a snippet/link (publisher-hosted content), not full reproductions.

Clarifying Questions to Ask

  • Scope: global multi-language and multi-region, or a single language/region to start?
  • Do we need personalized feeds (per-user ranking) or a single editorial/region feed in v1?
  • Is full-text search in scope, or only the browse/feed + trending experience?
  • Do we host article text or only link out with a snippet (this changes storage and copyright handling)?
  • How fresh must trending be — sub-minute, or a few minutes acceptable?
  • Multimedia (images/video) in scope, or text articles only for now?

Part 1 — Distributed crawling & fleet coordination

Design the crawl/ingestion subsystem that discovers and fetches new articles from tens of thousands of sources with low latency, while respecting per-domain politeness. The fleet is many worker machines that must divide the source list among themselves. Explain how work is partitioned, how you avoid two workers fetching the same source, how you schedule recrawls, and — critically — how the fleet stays coordinated when the coordinator node or a worker dies (leader election, membership, failure detection, rebalancing).

Clarifying Questions for this Part

  • Can we consume push/pull feeds (RSS, sitemaps, PubSubHubbub / WebSub) for most sources, or must we crawl raw HTML and discover article URLs ourselves?
  • Is there an SLA difference between "tier-1" high-value sources (crawl every few seconds) and the long tail (crawl hourly)?

What This Part Should Cover Premium

Part 2 — Processing: dedup, clustering into stories, and embeddings

Once raw articles arrive, design the pipeline that normalizes them, detects near-duplicates across syndicating sources, and clusters articles covering the same event into one story. The pipeline computes vector embeddings used for similarity. Address the trade-off the interviewer probed directly: should you persist the article to the database first and then trigger embedding asynchronously, or compute the embedding inline before the write? Reason about consistency, latency, durability, and accuracy.

What This Part Should Cover Premium

Part 3 — Trending / hot-topic detection

Design how the system detects hot / trending stories in near real time — e.g., a breaking event whose coverage and reader interest suddenly spike. Define what makes a topic "hot" and how you compute it at scale.

What This Part Should Cover Premium

Part 4 — Storage choices, data model, and serving

Choose storage for the major datasets — raw articles & metadata, story clusters, embeddings, engagement/trending counters, and the served feed. The interviewer rejected "both a relational and a document DB would work — pick one and justify it." Make concrete per-dataset choices and justify them by access pattern. Then sketch the read/serving path that builds a deduplicated, ranked feed with low latency, and note where full-text/vector search would fit if added.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

  • How do you guarantee exactly-once-ish processing from crawl to store given retries and worker crashes (idempotency keys, dedup, outbox)?
  • A single major event 100x's article volume in minutes — where does the pipeline break first, and how do you protect it (backpressure, autoscaling, load shedding, priority queues)?
  • How would you cluster the same event across languages to add multilingual support?
  • How do you keep each reader's feed fresh without recomputing ranking on every request (incremental updates, cache invalidation when a new high-rank article lands)?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Rippling•More Software Engineer•Rippling Software Engineer•Rippling System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.