Design a news aggregator system
Company: Rippling
Role: Software Engineer
Category: System Design
Difficulty: easy
Interview Round: Technical Screen
## Rippling Onsite — Two Parts
This Software Engineer onsite has **two parts**: a system design and a short coding follow-up. Both are below. Treat them as one session — the interviewer expects breadth on Part 1 and clean, correct code on Part 2.
---
# Part 1 — System Design: News Aggregator
Design a **news aggregator** (similar to a "Top stories" / Google News–style product) that ingests articles from many publishers and serves ranked feeds to users.
### Core requirements
- **Ingest** articles from thousands of sources (RSS/Atom feeds, publisher APIs, webhooks).
- **Normalize & store** article content and metadata: title, body/snippet, author, publish time, canonical URL, source, topics/tags.
- **De-duplicate** near-identical stories across sources (the same event reported by many outlets).
- **Rank & serve** feeds:
- A **homepage feed** (global ranking).
- A **topic feed** (e.g., Sports, Tech).
- Optional: a **personalized** feed based on user interests.
- **Low-latency reads** for feed browsing, with **freshness** that matters (new stories appear quickly).
### Non-functional requirements (assume typical consumer scale)
- High availability with multi-region read support.
- Ability to handle traffic spikes during breaking news.
- Reasonable content safety (basic spam / malicious-source handling).
### What to cover
1. **APIs** — both read-facing and ingestion-facing.
2. **Data model and storage** choices.
3. **Ingestion + processing pipeline** — parsing, enrichment, dedup.
4. **Ranking approach** — signals, and batch vs. real-time.
5. **Caching and feed-generation** strategy.
6. **Reliability, backfills, and monitoring.**
You may state assumptions (traffic, QPS, data volume) as needed.
```hint What the answers unlock
The single most important answer is the **read:write ratio**. If reads dwarf writes and the *same* feed serves millions of anonymous users, that pushes you toward **precomputing feeds on the write path** so the read path is a trivial "fetch a precomputed list + hydrate" — you never rank at read time for global/topic feeds.
```
### Clarifying Questions to Ask
Scope the problem before drawing boxes. Strong candidates ask:
- **Read:write ratio and scale** — how many feed reads per second at peak vs. how many new articles ingested per day? (This drives whether you rank at read time or precompute.)
- **Freshness SLO** — how fast must a newly published story appear in feeds: seconds, minutes, or "eventually"? Is breaking news tighter than the steady state?
- **Read-latency target** — what p95/p99 is acceptable for loading a feed page?
- **Personalization** — is the personalized feed in scope for v1, or are the home + topic feeds the priority? How much ML investment is expected?
- **Dedup tolerance** — is it worse to over-merge (fuse two distinct events) or under-merge (show the same story twice)? This sets the clustering threshold bias.
- **Source trust** — is the source set a curated allow-list, or open ingestion that needs spam/abuse defenses?
### Constraints & Assumptions
State numbers out loud — the interviewer cares about the reasoning, not exact figures. A defensible anchor set (adjust as you reason):
- **Sources:** ~tens of thousands of feeds/publishers (RSS/Atom, APIs, webhooks).
- **New articles:** millions/day, **bursty** (5–10× spike on breaking news, time-zone skewed).
- **Read traffic:** read-heavy by roughly three orders of magnitude — mostly anonymous home/topic feed requests.
- **Freshness SLO:** new story visible within ~1–5 min (tighter for breaking news).
- **Read latency SLO:** p95 in the low hundreds of milliseconds for a feed page.
- **Availability:** reads must stay up regionally even when ingestion degrades (reads and writes should fail independently).
### What a Strong Answer Covers
The interviewer is checking for these dimensions (signals, not the answers themselves):
- **Capacity estimate** — back-of-envelope QPS, article volume, and storage sizing that *drives* a design decision rather than sitting unused.
- **The precompute insight** — recognizing that read:write skew + shared feeds means you materialize feeds on the write path.
- **Clean plane split** — separating a high-throughput async ingestion/write plane from a low-latency read/serving plane, with a durable buffer between them.
- **Data model** — distinct Article (one publisher's version) vs. Story/Cluster (the deduped event) vs. materialized feed rows; sensible storage choice per access pattern.
- **The dedup/clustering pipeline** — candidate retrieval, a multi-signal match decision, and named failure modes (over-merge vs. under-merge).
- **Ranking** — concrete signals, a freshness-decay treatment, and a batch-vs-streaming split.
- **The read path & caching** — how a feed read stays cheap; what is and isn't CDN-cacheable.
- **Reliability** — idempotency under at-least-once delivery, behavior during ingestion outages, burst absorption, backfills/reprocessing.
- **Monitoring & abuse** — end-to-end freshness lag, dedup quality, serving metrics, and basic content-safety handling.
### Follow-up Questions
Be ready for the interviewer to push deeper:
- **Breaking news:** a single event suddenly arrives from 500 sources in 60 seconds — what saturates first (ingestion, dedup, ranking, or serving), and how does each plane absorb it?
- **Dedup at the edges:** how do you handle "same headline, different event" (e.g. two earthquakes) versus an evolving story whose details change over hours?
- **Freshness vs. cost:** if you lengthen the feed cache TTL to cut read cost, what breaks, and how do you keep breaking news fast anyway?
- **Consistency:** two articles about a brand-new event arrive concurrently and each spawns its own cluster — how do you prevent or repair the duplicate?
---
# Part 2 — Coding: Article Voting Tracker
As a follow-up, implement a small **article voting tracker**.
### Requirements
- Each user can **upvote** or **downvote** an article.
- Re-casting the **same** vote on the same article counts only **once** (upvoting an article you already upvoted is a no-op).
- **Changing** your vote on the same article (up → down, or down → up) counts as a **separate action**.
- Support printing the **last three votes** of a given user.
For example, if you upvote the same article twice, that counts as a single vote; but if you upvote an article and then downvote it, that counts as two actions.
```hint Where to start
There are **two distinct concepts** hiding in these rules. One is the *current* vote state for each `(user, article)` pair — this is what makes a repeat vote a no-op. The other is the user's *ordered history of actions* — this is what "last three votes" reads from. Model them separately.
```
```hint Picking the structures
For the *current state*, you want a lookup keyed by the user-and-article pair so the no-op check is a single comparison against whatever's already there. For the *history*, you want something that preserves insertion order and makes "the most recent few" cheap to read. Decide what each structure is keyed by and what it stores before you write `vote(...)`.
```
```hint Edge cases to raise
Think about the read path and the boundaries: does asking for the history of a user who has never voted quietly create state, or stay read-only? What should "last three" return when the user has acted fewer than three times? And is keeping the *full* action log forever necessary, or could the storage be bounded if only a fixed number of recent votes is ever read?
```
### What a Strong Answer Covers
- Correctly separating **idempotent current state** from the **append-only action history**.
- A `vote(...)` operation that is $O(1)$ and a `last_three(...)` read that is cheap and **non-mutating**.
- Input validation and clear handling of the no-op vs. counted-action distinction.
- Naming the edge cases (repeated flips, unknown user, bounded history, optional un-vote) and how the model would extend to a durable store.
Quick Answer: This question evaluates system design and distributed-systems competencies—specifically scalable ingestion, data modeling, deduplication, ranking, and low-latency feed serving—and is categorized under System Design.