PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/Rippling

Design a news aggregator system

Last updated: Apr 23, 2026

Quick Overview

This question evaluates system design and distributed-systems competencies—specifically scalable ingestion, data modeling, deduplication, ranking, and low-latency feed serving—and is categorized under System Design.

  • easy
  • Rippling
  • System Design
  • Software Engineer

Design a news aggregator system

Company: Rippling

Role: Software Engineer

Category: System Design

Difficulty: easy

Interview Round: Technical Screen

## Rippling Onsite — Two Parts This Software Engineer onsite has **two parts**: a system design and a short coding follow-up. Both are below. Treat them as one session — the interviewer expects breadth on Part 1 and clean, correct code on Part 2. --- # Part 1 — System Design: News Aggregator Design a **news aggregator** (similar to a "Top stories" / Google News–style product) that ingests articles from many publishers and serves ranked feeds to users. ### Core requirements - **Ingest** articles from thousands of sources (RSS/Atom feeds, publisher APIs, webhooks). - **Normalize & store** article content and metadata: title, body/snippet, author, publish time, canonical URL, source, topics/tags. - **De-duplicate** near-identical stories across sources (the same event reported by many outlets). - **Rank & serve** feeds: - A **homepage feed** (global ranking). - A **topic feed** (e.g., Sports, Tech). - Optional: a **personalized** feed based on user interests. - **Low-latency reads** for feed browsing, with **freshness** that matters (new stories appear quickly). ### Non-functional requirements (assume typical consumer scale) - High availability with multi-region read support. - Ability to handle traffic spikes during breaking news. - Reasonable content safety (basic spam / malicious-source handling). ### What to cover 1. **APIs** — both read-facing and ingestion-facing. 2. **Data model and storage** choices. 3. **Ingestion + processing pipeline** — parsing, enrichment, dedup. 4. **Ranking approach** — signals, and batch vs. real-time. 5. **Caching and feed-generation** strategy. 6. **Reliability, backfills, and monitoring.** You may state assumptions (traffic, QPS, data volume) as needed. ```hint What the answers unlock The single most important answer is the **read:write ratio**. If reads dwarf writes and the *same* feed serves millions of anonymous users, that pushes you toward **precomputing feeds on the write path** so the read path is a trivial "fetch a precomputed list + hydrate" — you never rank at read time for global/topic feeds. ``` ### Clarifying Questions to Ask Scope the problem before drawing boxes. Strong candidates ask: - **Read:write ratio and scale** — how many feed reads per second at peak vs. how many new articles ingested per day? (This drives whether you rank at read time or precompute.) - **Freshness SLO** — how fast must a newly published story appear in feeds: seconds, minutes, or "eventually"? Is breaking news tighter than the steady state? - **Read-latency target** — what p95/p99 is acceptable for loading a feed page? - **Personalization** — is the personalized feed in scope for v1, or are the home + topic feeds the priority? How much ML investment is expected? - **Dedup tolerance** — is it worse to over-merge (fuse two distinct events) or under-merge (show the same story twice)? This sets the clustering threshold bias. - **Source trust** — is the source set a curated allow-list, or open ingestion that needs spam/abuse defenses? ### Constraints & Assumptions State numbers out loud — the interviewer cares about the reasoning, not exact figures. A defensible anchor set (adjust as you reason): - **Sources:** ~tens of thousands of feeds/publishers (RSS/Atom, APIs, webhooks). - **New articles:** millions/day, **bursty** (5–10× spike on breaking news, time-zone skewed). - **Read traffic:** read-heavy by roughly three orders of magnitude — mostly anonymous home/topic feed requests. - **Freshness SLO:** new story visible within ~1–5 min (tighter for breaking news). - **Read latency SLO:** p95 in the low hundreds of milliseconds for a feed page. - **Availability:** reads must stay up regionally even when ingestion degrades (reads and writes should fail independently). ### What a Strong Answer Covers The interviewer is checking for these dimensions (signals, not the answers themselves): - **Capacity estimate** — back-of-envelope QPS, article volume, and storage sizing that *drives* a design decision rather than sitting unused. - **The precompute insight** — recognizing that read:write skew + shared feeds means you materialize feeds on the write path. - **Clean plane split** — separating a high-throughput async ingestion/write plane from a low-latency read/serving plane, with a durable buffer between them. - **Data model** — distinct Article (one publisher's version) vs. Story/Cluster (the deduped event) vs. materialized feed rows; sensible storage choice per access pattern. - **The dedup/clustering pipeline** — candidate retrieval, a multi-signal match decision, and named failure modes (over-merge vs. under-merge). - **Ranking** — concrete signals, a freshness-decay treatment, and a batch-vs-streaming split. - **The read path & caching** — how a feed read stays cheap; what is and isn't CDN-cacheable. - **Reliability** — idempotency under at-least-once delivery, behavior during ingestion outages, burst absorption, backfills/reprocessing. - **Monitoring & abuse** — end-to-end freshness lag, dedup quality, serving metrics, and basic content-safety handling. ### Follow-up Questions Be ready for the interviewer to push deeper: - **Breaking news:** a single event suddenly arrives from 500 sources in 60 seconds — what saturates first (ingestion, dedup, ranking, or serving), and how does each plane absorb it? - **Dedup at the edges:** how do you handle "same headline, different event" (e.g. two earthquakes) versus an evolving story whose details change over hours? - **Freshness vs. cost:** if you lengthen the feed cache TTL to cut read cost, what breaks, and how do you keep breaking news fast anyway? - **Consistency:** two articles about a brand-new event arrive concurrently and each spawns its own cluster — how do you prevent or repair the duplicate? --- # Part 2 — Coding: Article Voting Tracker As a follow-up, implement a small **article voting tracker**. ### Requirements - Each user can **upvote** or **downvote** an article. - Re-casting the **same** vote on the same article counts only **once** (upvoting an article you already upvoted is a no-op). - **Changing** your vote on the same article (up → down, or down → up) counts as a **separate action**. - Support printing the **last three votes** of a given user. For example, if you upvote the same article twice, that counts as a single vote; but if you upvote an article and then downvote it, that counts as two actions. ```hint Where to start There are **two distinct concepts** hiding in these rules. One is the *current* vote state for each `(user, article)` pair — this is what makes a repeat vote a no-op. The other is the user's *ordered history of actions* — this is what "last three votes" reads from. Model them separately. ``` ```hint Picking the structures For the *current state*, you want a lookup keyed by the user-and-article pair so the no-op check is a single comparison against whatever's already there. For the *history*, you want something that preserves insertion order and makes "the most recent few" cheap to read. Decide what each structure is keyed by and what it stores before you write `vote(...)`. ``` ```hint Edge cases to raise Think about the read path and the boundaries: does asking for the history of a user who has never voted quietly create state, or stay read-only? What should "last three" return when the user has acted fewer than three times? And is keeping the *full* action log forever necessary, or could the storage be bounded if only a fixed number of recent votes is ever read? ``` ### What a Strong Answer Covers - Correctly separating **idempotent current state** from the **append-only action history**. - A `vote(...)` operation that is $O(1)$ and a `last_three(...)` read that is cheap and **non-mutating**. - Input validation and clear handling of the no-op vs. counted-action distinction. - Naming the edge cases (repeated flips, unknown user, bounded history, optional un-vote) and how the model would extend to a durable store.

Quick Answer: This question evaluates system design and distributed-systems competencies—specifically scalable ingestion, data modeling, deduplication, ranking, and low-latency feed serving—and is categorized under System Design.

Related Interview Questions

  • Prevent Duplicate Payments Under High Load - Rippling
  • Design a personalized news aggregator - Rippling (medium)
  • Design a Scalable News Feed - Rippling (medium)
  • Design Scalable Expense Violation Processing - Rippling (hard)
  • Design several large-scale systems - Rippling (hard)
|Home/System Design/Rippling

Design a news aggregator system

Rippling logo
Rippling
Dec 5, 2025, 12:00 AM
easySoftware EngineerTechnical ScreenSystem Design
135
0

Rippling Onsite — Two Parts

This Software Engineer onsite has two parts: a system design and a short coding follow-up. Both are below. Treat them as one session — the interviewer expects breadth on Part 1 and clean, correct code on Part 2.

Part 1 — System Design: News Aggregator

Design a news aggregator (similar to a "Top stories" / Google News–style product) that ingests articles from many publishers and serves ranked feeds to users.

Core requirements

  • Ingest articles from thousands of sources (RSS/Atom feeds, publisher APIs, webhooks).
  • Normalize & store article content and metadata: title, body/snippet, author, publish time, canonical URL, source, topics/tags.
  • De-duplicate near-identical stories across sources (the same event reported by many outlets).
  • Rank & serve feeds:
    • A homepage feed (global ranking).
    • A topic feed (e.g., Sports, Tech).
    • Optional: a personalized feed based on user interests.
  • Low-latency reads for feed browsing, with freshness that matters (new stories appear quickly).

Non-functional requirements (assume typical consumer scale)

  • High availability with multi-region read support.
  • Ability to handle traffic spikes during breaking news.
  • Reasonable content safety (basic spam / malicious-source handling).

What to cover

  1. APIs — both read-facing and ingestion-facing.
  2. Data model and storage choices.
  3. Ingestion + processing pipeline — parsing, enrichment, dedup.
  4. Ranking approach — signals, and batch vs. real-time.
  5. Caching and feed-generation strategy.
  6. Reliability, backfills, and monitoring.

You may state assumptions (traffic, QPS, data volume) as needed.

Clarifying Questions to Ask

Scope the problem before drawing boxes. Strong candidates ask:

  • Read:write ratio and scale — how many feed reads per second at peak vs. how many new articles ingested per day? (This drives whether you rank at read time or precompute.)
  • Freshness SLO — how fast must a newly published story appear in feeds: seconds, minutes, or "eventually"? Is breaking news tighter than the steady state?
  • Read-latency target — what p95/p99 is acceptable for loading a feed page?
  • Personalization — is the personalized feed in scope for v1, or are the home + topic feeds the priority? How much ML investment is expected?
  • Dedup tolerance — is it worse to over-merge (fuse two distinct events) or under-merge (show the same story twice)? This sets the clustering threshold bias.
  • Source trust — is the source set a curated allow-list, or open ingestion that needs spam/abuse defenses?

Constraints & Assumptions

State numbers out loud — the interviewer cares about the reasoning, not exact figures. A defensible anchor set (adjust as you reason):

  • Sources: ~tens of thousands of feeds/publishers (RSS/Atom, APIs, webhooks).
  • New articles: millions/day, bursty (5–10× spike on breaking news, time-zone skewed).
  • Read traffic: read-heavy by roughly three orders of magnitude — mostly anonymous home/topic feed requests.
  • Freshness SLO: new story visible within ~1–5 min (tighter for breaking news).
  • Read latency SLO: p95 in the low hundreds of milliseconds for a feed page.
  • Availability: reads must stay up regionally even when ingestion degrades (reads and writes should fail independently).

What a Strong Answer Covers

The interviewer is checking for these dimensions (signals, not the answers themselves):

  • Capacity estimate — back-of-envelope QPS, article volume, and storage sizing that drives a design decision rather than sitting unused.
  • The precompute insight — recognizing that read:write skew + shared feeds means you materialize feeds on the write path.
  • Clean plane split — separating a high-throughput async ingestion/write plane from a low-latency read/serving plane, with a durable buffer between them.
  • Data model — distinct Article (one publisher's version) vs. Story/Cluster (the deduped event) vs. materialized feed rows; sensible storage choice per access pattern.
  • The dedup/clustering pipeline — candidate retrieval, a multi-signal match decision, and named failure modes (over-merge vs. under-merge).
  • Ranking — concrete signals, a freshness-decay treatment, and a batch-vs-streaming split.
  • The read path & caching — how a feed read stays cheap; what is and isn't CDN-cacheable.
  • Reliability — idempotency under at-least-once delivery, behavior during ingestion outages, burst absorption, backfills/reprocessing.
  • Monitoring & abuse — end-to-end freshness lag, dedup quality, serving metrics, and basic content-safety handling.

Follow-up Questions

Be ready for the interviewer to push deeper:

  • Breaking news: a single event suddenly arrives from 500 sources in 60 seconds — what saturates first (ingestion, dedup, ranking, or serving), and how does each plane absorb it?
  • Dedup at the edges: how do you handle "same headline, different event" (e.g. two earthquakes) versus an evolving story whose details change over hours?
  • Freshness vs. cost: if you lengthen the feed cache TTL to cut read cost, what breaks, and how do you keep breaking news fast anyway?
  • Consistency: two articles about a brand-new event arrive concurrently and each spawns its own cluster — how do you prevent or repair the duplicate?

Part 2 — Coding: Article Voting Tracker

As a follow-up, implement a small article voting tracker.

Requirements

  • Each user can upvote or downvote an article.
  • Re-casting the same vote on the same article counts only once (upvoting an article you already upvoted is a no-op).
  • Changing your vote on the same article (up → down, or down → up) counts as a separate action .
  • Support printing the last three votes of a given user.

For example, if you upvote the same article twice, that counts as a single vote; but if you upvote an article and then downvote it, that counts as two actions.

What a Strong Answer Covers

  • Correctly separating idempotent current state from the append-only action history .
  • A vote(...) operation that is O(1)O(1)O(1) and a last_three(...) read that is cheap and non-mutating .
  • Input validation and clear handling of the no-op vs. counted-action distinction.
  • Naming the edge cases (repeated flips, unknown user, bounded history, optional un-vote) and how the model would extend to a durable store.

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Rippling•More Software Engineer•Rippling Software Engineer•Rippling System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.