How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a easy difficulty System Design question, commonly asked during Technical Screen rounds at Rippling.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Rippling during technical interviews.

Design a news aggregator system | Rippling Interview Question

Q: Design a news aggregator system

This question evaluates system design and distributed-systems competencies—specifically scalable ingestion, data modeling, deduplication, ranking, and low-latency feed serving—and is categorized under System Design.

Rippling Onsite — Two Parts

This Software Engineer onsite has two parts: a system design and a short coding follow-up. Both are below. Treat them as one session — the interviewer expects breadth on Part 1 and clean, correct code on Part 2.

Part 1 — System Design: News Aggregator

Design a news aggregator (similar to a "Top stories" / Google News–style product) that ingests articles from many publishers and serves ranked feeds to users.

Core requirements

Ingest articles from thousands of sources (RSS/Atom feeds, publisher APIs, webhooks).
Normalize & store article content and metadata: title, body/snippet, author, publish time, canonical URL, source, topics/tags.
De-duplicate near-identical stories across sources (the same event reported by many outlets).
Rank & serve feeds:
- A homepage feed (global ranking).
- A topic feed (e.g., Sports, Tech).
- Optional: a personalized feed based on user interests.
Low-latency reads for feed browsing, with freshness that matters (new stories appear quickly).

Non-functional requirements (assume typical consumer scale)

High availability with multi-region read support.
Ability to handle traffic spikes during breaking news.
Reasonable content safety (basic spam / malicious-source handling).

What to cover

APIs — both read-facing and ingestion-facing.
Data model and storage choices.
Ingestion + processing pipeline — parsing, enrichment, dedup.
Ranking approach — signals, and batch vs. real-time.
Caching and feed-generation strategy.
Reliability, backfills, and monitoring.

You may state assumptions (traffic, QPS, data volume) as needed.

Clarifying Questions to Ask

Scope the problem before drawing boxes. Strong candidates ask:

Read:write ratio and scale — how many feed reads per second at peak vs. how many new articles ingested per day? (This drives whether you rank at read time or precompute.)
Freshness SLO — how fast must a newly published story appear in feeds: seconds, minutes, or "eventually"? Is breaking news tighter than the steady state?
Read-latency target — what p95/p99 is acceptable for loading a feed page?
Personalization — is the personalized feed in scope for v1, or are the home + topic feeds the priority? How much ML investment is expected?
Dedup tolerance — is it worse to over-merge (fuse two distinct events) or under-merge (show the same story twice)? This sets the clustering threshold bias.
Source trust — is the source set a curated allow-list, or open ingestion that needs spam/abuse defenses?

Constraints & Assumptions

State numbers out loud — the interviewer cares about the reasoning, not exact figures. A defensible anchor set (adjust as you reason):

Sources: ~tens of thousands of feeds/publishers (RSS/Atom, APIs, webhooks).
New articles: millions/day, bursty (5–10× spike on breaking news, time-zone skewed).
Read traffic: read-heavy by roughly three orders of magnitude — mostly anonymous home/topic feed requests.
Freshness SLO: new story visible within ~1–5 min (tighter for breaking news).
Read latency SLO: p95 in the low hundreds of milliseconds for a feed page.
Availability: reads must stay up regionally even when ingestion degrades (reads and writes should fail independently).

What a Strong Answer Covers

The interviewer is checking for these dimensions (signals, not the answers themselves):

Capacity estimate — back-of-envelope QPS, article volume, and storage sizing that drives a design decision rather than sitting unused.
The precompute insight — recognizing that read:write skew + shared feeds means you materialize feeds on the write path.
Clean plane split — separating a high-throughput async ingestion/write plane from a low-latency read/serving plane, with a durable buffer between them.
Data model — distinct Article (one publisher's version) vs. Story/Cluster (the deduped event) vs. materialized feed rows; sensible storage choice per access pattern.
The dedup/clustering pipeline — candidate retrieval, a multi-signal match decision, and named failure modes (over-merge vs. under-merge).
Ranking — concrete signals, a freshness-decay treatment, and a batch-vs-streaming split.
The read path & caching — how a feed read stays cheap; what is and isn't CDN-cacheable.
Reliability — idempotency under at-least-once delivery, behavior during ingestion outages, burst absorption, backfills/reprocessing.
Monitoring & abuse — end-to-end freshness lag, dedup quality, serving metrics, and basic content-safety handling.

Follow-up Questions

Be ready for the interviewer to push deeper:

Breaking news: a single event suddenly arrives from 500 sources in 60 seconds — what saturates first (ingestion, dedup, ranking, or serving), and how does each plane absorb it?
Dedup at the edges: how do you handle "same headline, different event" (e.g. two earthquakes) versus an evolving story whose details change over hours?
Freshness vs. cost: if you lengthen the feed cache TTL to cut read cost, what breaks, and how do you keep breaking news fast anyway?
Consistency: two articles about a brand-new event arrive concurrently and each spawns its own cluster — how do you prevent or repair the duplicate?

Part 2 — Coding: Article Voting Tracker

As a follow-up, implement a small article voting tracker.

Requirements

Each user can upvote or downvote an article.
Re-casting the same vote on the same article counts only once (upvoting an article you already upvoted is a no-op).
Changing your vote on the same article (up → down, or down → up) counts as a separate action .
Support printing the last three votes of a given user.

For example, if you upvote the same article twice, that counts as a single vote; but if you upvote an article and then downvote it, that counts as two actions.

What a Strong Answer Covers

Correctly separating idempotent current state from the append-only action history .
A vote(...) operation that is $O(1)$ and a last_three(...) read that is cheap and non-mutating .
Input validation and clear handling of the no-op vs. counted-action distinction.
Naming the edge cases (repeated flips, unknown user, bounded history, optional un-vote) and how the model would extend to a durable store.

Rippling Onsite — Two Parts

Part 1 — System Design: News Aggregator

Design a news aggregator (similar to a "Top stories" / Google News–style product) that ingests articles from many publishers and serves ranked feeds to users.

Core requirements

Ingest articles from thousands of sources (RSS/Atom feeds, publisher APIs, webhooks).
Normalize & store article content and metadata: title, body/snippet, author, publish time, canonical URL, source, topics/tags.
De-duplicate near-identical stories across sources (the same event reported by many outlets).
Rank & serve feeds:
- A homepage feed (global ranking).
- A topic feed (e.g., Sports, Tech).
- Optional: a personalized feed based on user interests.
Low-latency reads for feed browsing, with freshness that matters (new stories appear quickly).

Non-functional requirements (assume typical consumer scale)

High availability with multi-region read support.
Ability to handle traffic spikes during breaking news.
Reasonable content safety (basic spam / malicious-source handling).

What to cover

APIs — both read-facing and ingestion-facing.
Data model and storage choices.
Ingestion + processing pipeline — parsing, enrichment, dedup.
Ranking approach — signals, and batch vs. real-time.
Caching and feed-generation strategy.
Reliability, backfills, and monitoring.

You may state assumptions (traffic, QPS, data volume) as needed.

Clarifying Questions to Ask

Scope the problem before drawing boxes. Strong candidates ask:

Read:write ratio and scale — how many feed reads per second at peak vs. how many new articles ingested per day? (This drives whether you rank at read time or precompute.)
Freshness SLO — how fast must a newly published story appear in feeds: seconds, minutes, or "eventually"? Is breaking news tighter than the steady state?
Read-latency target — what p95/p99 is acceptable for loading a feed page?
Personalization — is the personalized feed in scope for v1, or are the home + topic feeds the priority? How much ML investment is expected?
Dedup tolerance — is it worse to over-merge (fuse two distinct events) or under-merge (show the same story twice)? This sets the clustering threshold bias.
Source trust — is the source set a curated allow-list, or open ingestion that needs spam/abuse defenses?

Constraints & Assumptions

State numbers out loud — the interviewer cares about the reasoning, not exact figures. A defensible anchor set (adjust as you reason):

Sources: ~tens of thousands of feeds/publishers (RSS/Atom, APIs, webhooks).
New articles: millions/day, bursty (5–10× spike on breaking news, time-zone skewed).
Read traffic: read-heavy by roughly three orders of magnitude — mostly anonymous home/topic feed requests.
Freshness SLO: new story visible within ~1–5 min (tighter for breaking news).
Read latency SLO: p95 in the low hundreds of milliseconds for a feed page.
Availability: reads must stay up regionally even when ingestion degrades (reads and writes should fail independently).

What a Strong Answer Covers

The interviewer is checking for these dimensions (signals, not the answers themselves):

Capacity estimate — back-of-envelope QPS, article volume, and storage sizing that drives a design decision rather than sitting unused.
The precompute insight — recognizing that read:write skew + shared feeds means you materialize feeds on the write path.
Clean plane split — separating a high-throughput async ingestion/write plane from a low-latency read/serving plane, with a durable buffer between them.
Data model — distinct Article (one publisher's version) vs. Story/Cluster (the deduped event) vs. materialized feed rows; sensible storage choice per access pattern.
The dedup/clustering pipeline — candidate retrieval, a multi-signal match decision, and named failure modes (over-merge vs. under-merge).
Ranking — concrete signals, a freshness-decay treatment, and a batch-vs-streaming split.
The read path & caching — how a feed read stays cheap; what is and isn't CDN-cacheable.
Reliability — idempotency under at-least-once delivery, behavior during ingestion outages, burst absorption, backfills/reprocessing.
Monitoring & abuse — end-to-end freshness lag, dedup quality, serving metrics, and basic content-safety handling.

Follow-up Questions

Be ready for the interviewer to push deeper:

Breaking news: a single event suddenly arrives from 500 sources in 60 seconds — what saturates first (ingestion, dedup, ranking, or serving), and how does each plane absorb it?
Dedup at the edges: how do you handle "same headline, different event" (e.g. two earthquakes) versus an evolving story whose details change over hours?
Freshness vs. cost: if you lengthen the feed cache TTL to cut read cost, what breaks, and how do you keep breaking news fast anyway?
Consistency: two articles about a brand-new event arrive concurrently and each spawns its own cluster — how do you prevent or repair the duplicate?

Part 2 — Coding: Article Voting Tracker

As a follow-up, implement a small article voting tracker.

Requirements

Each user can upvote or downvote an article.
Re-casting the same vote on the same article counts only once (upvoting an article you already upvoted is a no-op).
Changing your vote on the same article (up → down, or down → up) counts as a separate action .
Support printing the last three votes of a given user.

For example, if you upvote the same article twice, that counts as a single vote; but if you upvote an article and then downvote it, that counts as two actions.

What a Strong Answer Covers

Correctly separating idempotent current state from the append-only action history .
A vote(...) operation that is $O(1)$ and a last_three(...) read that is cheap and non-mutating .
Input validation and clear handling of the no-op vs. counted-action distinction.
Naming the edge cases (repeated flips, unknown user, bounded history, optional un-vote) and how the model would extend to a durable store.

Design a news aggregator system

Quick Overview

Design a news aggregator system

Rippling Onsite — Two Parts

Part 1 — System Design: News Aggregator

Core requirements

Non-functional requirements (assume typical consumer scale)

What to cover

Clarifying Questions to Ask

Constraints & Assumptions

What a Strong Answer Covers

Follow-up Questions

Part 2 — Coding: Article Voting Tracker

Requirements

What a Strong Answer Covers

Submit Your Answer to Earn 20XP

Design a news aggregator system

Quick Overview

Design a news aggregator system

Rippling Onsite — Two Parts

Part 1 — System Design: News Aggregator

Core requirements

Non-functional requirements (assume typical consumer scale)

What to cover

Clarifying Questions to Ask

Constraints & Assumptions

What a Strong Answer Covers

Follow-up Questions

Part 2 — Coding: Article Voting Tracker

Requirements

What a Strong Answer Covers

Submit Your Answer to Earn 20XP