How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Onsite rounds at Confluent.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Confluent during technical interviews.

Design an RSS News Feed Service | Confluent Interview Question

Q: Design an RSS News Feed Service

This question evaluates a candidate's ability to design a scalable RSS news feed service, emphasizing system architecture, relational schema and API design, including database table and index planning, ingestion and serving paths, deduplication/idempotency, and feed generation trade-offs.

Design an RSS news feed service.

The service lets users subscribe to RSS sources, ingests articles from those sources on a schedule, stores article metadata, and serves each user a personalized feed of articles drawn from the sources they follow. You do not need to draw diagrams, but you must clearly explain the main components, the storage model, and the request flows.

This interview emphasizes two areas above all others: database table design and API design. Spend the bulk of your time there. Be concrete — name the tables, their columns, primary keys, the indexes that make the hot queries fast, and the endpoints with their request/response shapes.

Constraints & Assumptions

Single logical service; you may assume a relational primary store (e.g. PostgreSQL) with the option to offload feed items to a key-value/wide-column store at scale.
Eventual consistency is acceptable: a newly published article need not appear in feeds instantly — seconds-to-minutes of lag is fine.
Sources number in the hundreds of thousands; a popular source may be followed by a large fraction of users (fanout hot spot).
Feed reads dominate traffic and must be low-latency; ingestion is bursty (some sources publish many items at once).
RSS metadata is unreliable: guid may be missing, published_at may be absent or wrong, URLs may change, and the same article may be syndicated by multiple sources.
You may assume an external auth layer supplies an authenticated user_id .

Clarifying Questions to Ask

What is the read:write ratio, and roughly how many sources does a typical user subscribe to (tens? thousands)? This drives fanout-on-read vs. fanout-on-write.
How fresh must feeds be — is minutes-of-lag acceptable, or is near-real-time required?
Do we need full-text article content and search, or only metadata (title, summary, link)?
What per-user state must we track (read / saved / hidden), and is read-state required to be strongly consistent?
Are there a few "celebrity" sources with millions of subscribers we must plan a fanout hot spot around?
Do we control the article publishers, or are these arbitrary third-party RSS feeds (affecting fetch politeness, rate limits, and trust)?

What a Strong Answer Covers

Requirements split into functional (subscribe, ingest, dedup, serve paginated feed, per-user state) and non-functional (low-latency reads, idempotent ingestion, eventual consistency, source-failure isolation).
A concrete relational schema : every core table with columns, primary keys, and the specific indexes that back each hot query (subscription lookup, fanout to subscribers, feed read).
Idempotent, deduplicated ingestion : conditional fetches ( ETag / If-Modified-Since ), GUID/canonical-URL dedup, upsert-on-conflict, and a clear decoupling of fetch → parse → store stages.
A clear API surface with REST endpoints, request/response bodies, and cursor-based pagination on the feed endpoint.
An explicit feed-generation decision : fanout-on-read vs. fanout-on-write, the trade-offs, and a defended default (typically a hybrid).
Scaling and failure handling : per-domain fetch rate limiting, exponential backoff and dead-lettering for broken feeds, partitioning strategy, caching of hot feed pages, and source-level failure isolation.
Edge cases : missing GUIDs, missing/incorrect published_at , cross-source syndication, post-publication article edits, and unsubscribe-after-materialization cleanup.
Observability : the ingestion and serving metrics you'd track (fetch latency, parse-failure rate, duplicate rate, queue lag, feed-read p99).

Follow-up Questions

A single source suddenly gains millions of subscribers and publishes a burst of articles. How does your fanout-on-write path avoid melting? (Consider lazy/hybrid fanout, batching, and treating celebrity sources as pull-only.)
A user unsubscribes from a source they were already materialized into under fanout-on-write. How do you keep stale articles out of their feed without an expensive synchronous delete?
Two different sources syndicate the identical article. How do you detect the duplicate at ingestion time, and how do you decide which copy a user sees?
How would you add full-text search over article titles and summaries, and where does that index live relative to the primary store?

Design an RSS news feed service.

Constraints & Assumptions

Single logical service; you may assume a relational primary store (e.g. PostgreSQL) with the option to offload feed items to a key-value/wide-column store at scale.
Eventual consistency is acceptable: a newly published article need not appear in feeds instantly — seconds-to-minutes of lag is fine.
Sources number in the hundreds of thousands; a popular source may be followed by a large fraction of users (fanout hot spot).
Feed reads dominate traffic and must be low-latency; ingestion is bursty (some sources publish many items at once).
RSS metadata is unreliable: guid may be missing, published_at may be absent or wrong, URLs may change, and the same article may be syndicated by multiple sources.
You may assume an external auth layer supplies an authenticated user_id .

Clarifying Questions to Ask

What is the read:write ratio, and roughly how many sources does a typical user subscribe to (tens? thousands)? This drives fanout-on-read vs. fanout-on-write.
How fresh must feeds be — is minutes-of-lag acceptable, or is near-real-time required?
Do we need full-text article content and search, or only metadata (title, summary, link)?
What per-user state must we track (read / saved / hidden), and is read-state required to be strongly consistent?
Are there a few "celebrity" sources with millions of subscribers we must plan a fanout hot spot around?
Do we control the article publishers, or are these arbitrary third-party RSS feeds (affecting fetch politeness, rate limits, and trust)?

What a Strong Answer Covers

Requirements split into functional (subscribe, ingest, dedup, serve paginated feed, per-user state) and non-functional (low-latency reads, idempotent ingestion, eventual consistency, source-failure isolation).
A concrete relational schema : every core table with columns, primary keys, and the specific indexes that back each hot query (subscription lookup, fanout to subscribers, feed read).
Idempotent, deduplicated ingestion : conditional fetches ( ETag / If-Modified-Since ), GUID/canonical-URL dedup, upsert-on-conflict, and a clear decoupling of fetch → parse → store stages.
A clear API surface with REST endpoints, request/response bodies, and cursor-based pagination on the feed endpoint.
An explicit feed-generation decision : fanout-on-read vs. fanout-on-write, the trade-offs, and a defended default (typically a hybrid).
Scaling and failure handling : per-domain fetch rate limiting, exponential backoff and dead-lettering for broken feeds, partitioning strategy, caching of hot feed pages, and source-level failure isolation.
Edge cases : missing GUIDs, missing/incorrect published_at , cross-source syndication, post-publication article edits, and unsubscribe-after-materialization cleanup.
Observability : the ingestion and serving metrics you'd track (fetch latency, parse-failure rate, duplicate rate, queue lag, feed-read p99).

Follow-up Questions

A single source suddenly gains millions of subscribers and publishes a burst of articles. How does your fanout-on-write path avoid melting? (Consider lazy/hybrid fanout, batching, and treating celebrity sources as pull-only.)
A user unsubscribes from a source they were already materialized into under fanout-on-write. How do you keep stale articles out of their feed without an expensive synchronous delete?
Two different sources syndicate the identical article. How do you detect the duplicate at ingestion time, and how do you decide which copy a user sees?
How would you add full-text search over article titles and summaries, and where does that index live relative to the primary store?

Design an RSS News Feed Service

Quick Overview

Design an RSS News Feed Service

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP

Design an RSS News Feed Service

Quick Overview

Design an RSS News Feed Service

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP