PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/Confluent

Design an RSS News Feed Service

Last updated: Jun 17, 2026

Quick Overview

This question evaluates a candidate's ability to design a scalable RSS news feed service, emphasizing system architecture, relational schema and API design, including database table and index planning, ingestion and serving paths, deduplication/idempotency, and feed generation trade-offs.

  • medium
  • Confluent
  • System Design
  • Software Engineer

Design an RSS News Feed Service

Company: Confluent

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Onsite

Design an **RSS news feed service**. The service lets users subscribe to RSS sources, ingests articles from those sources on a schedule, stores article metadata, and serves each user a personalized feed of articles drawn from the sources they follow. You do **not** need to draw diagrams, but you must clearly explain the main components, the storage model, and the request flows. This interview emphasizes two areas above all others: **database table design** and **API design**. Spend the bulk of your time there. Be concrete — name the tables, their columns, primary keys, the indexes that make the hot queries fast, and the endpoints with their request/response shapes. ```hint Where to start Split the system into two paths and design for each independently: the **ingestion path** (scheduler → fetch RSS → parse → store articles) and the **serving path** (a user requests their feed). Which of the two does your traffic pattern say must be the cheapest? Let that decision drive the schema. ``` ```hint Data model Sketch the entities before the columns. Which tables do you need so that "give me one user's feed" is answerable, and how do you model the relationship between users and the sources they follow? For each of your hot queries — listing a user's subscriptions, finding who follows a source, reading recent articles for a source — ask what single index would let that query avoid a sort or a scan. ``` ```hint Dedup and idempotency RSS fetches get retried, and the same article often reappears across re-fetches (and across syndicating sources). What property must an article insert have so a retry is harmless? What field would you key dedup on first — and what's your fallback when that field is missing or unstable? Then ask which layer (application vs. database) should *enforce* "no duplicates." ``` ```hint Feed generation — the key design decision The central question is *when* a user's feed is assembled: at read time (merge recent articles across their subscribed sources on request) or at write time (materialize per-user rows as each article lands). Weigh read cost vs. write amplification — especially what a single source followed by a huge fraction of users does to each approach. State which you'd pick by default and what would make you switch. ``` ```hint Pagination The naive choice (`OFFSET`) breaks when new articles arrive mid-scroll — rows duplicate or get skipped. What stable key could you page on instead so the boundary stays fixed as data shifts, and how does that key line up with the index backing the feed read? ``` ### Constraints & Assumptions - Single logical service; you may assume a relational primary store (e.g. PostgreSQL) with the option to offload feed items to a key-value/wide-column store at scale. - Eventual consistency is acceptable: a newly published article need not appear in feeds instantly — seconds-to-minutes of lag is fine. - Sources number in the hundreds of thousands; a popular source may be followed by a large fraction of users (fanout hot spot). - Feed **reads** dominate traffic and must be low-latency; ingestion is bursty (some sources publish many items at once). - RSS metadata is unreliable: `guid` may be missing, `published_at` may be absent or wrong, URLs may change, and the same article may be syndicated by multiple sources. - You may assume an external auth layer supplies an authenticated `user_id`. ### Clarifying Questions to Ask - What is the read:write ratio, and roughly how many sources does a typical user subscribe to (tens? thousands)? This drives fanout-on-read vs. fanout-on-write. - How fresh must feeds be — is minutes-of-lag acceptable, or is near-real-time required? - Do we need full-text article content and search, or only metadata (title, summary, link)? - What per-user state must we track (read / saved / hidden), and is read-state required to be strongly consistent? - Are there a few "celebrity" sources with millions of subscribers we must plan a fanout hot spot around? - Do we control the article publishers, or are these arbitrary third-party RSS feeds (affecting fetch politeness, rate limits, and trust)? ### What a Strong Answer Covers - **Requirements split** into functional (subscribe, ingest, dedup, serve paginated feed, per-user state) and non-functional (low-latency reads, idempotent ingestion, eventual consistency, source-failure isolation). - **A concrete relational schema**: every core table with columns, primary keys, and the specific indexes that back each hot query (subscription lookup, fanout to subscribers, feed read). - **Idempotent, deduplicated ingestion**: conditional fetches (`ETag`/`If-Modified-Since`), GUID/canonical-URL dedup, upsert-on-conflict, and a clear decoupling of fetch → parse → store stages. - **A clear API surface** with REST endpoints, request/response bodies, and cursor-based pagination on the feed endpoint. - **An explicit feed-generation decision**: fanout-on-read vs. fanout-on-write, the trade-offs, and a defended default (typically a hybrid). - **Scaling and failure handling**: per-domain fetch rate limiting, exponential backoff and dead-lettering for broken feeds, partitioning strategy, caching of hot feed pages, and source-level failure isolation. - **Edge cases**: missing GUIDs, missing/incorrect `published_at`, cross-source syndication, post-publication article edits, and unsubscribe-after-materialization cleanup. - **Observability**: the ingestion and serving metrics you'd track (fetch latency, parse-failure rate, duplicate rate, queue lag, feed-read p99). ### Follow-up Questions - A single source suddenly gains millions of subscribers and publishes a burst of articles. How does your fanout-on-write path avoid melting? (Consider lazy/hybrid fanout, batching, and treating celebrity sources as pull-only.) - A user unsubscribes from a source they were already materialized into under fanout-on-write. How do you keep stale articles out of their feed without an expensive synchronous delete? - Two different sources syndicate the identical article. How do you detect the duplicate at ingestion time, and how do you decide which copy a user sees? - How would you add full-text search over article titles and summaries, and where does that index live relative to the primary store?

Quick Answer: This question evaluates a candidate's ability to design a scalable RSS news feed service, emphasizing system architecture, relational schema and API design, including database table and index planning, ingestion and serving paths, deduplication/idempotency, and feed generation trade-offs.

Related Interview Questions

  • Design a News Feed and Mail Service - Confluent (medium)
  • Design RSS Feed and Temporary Mail - Confluent (medium)
  • Design a temporary email service - Confluent (hard)
  • Design a distributed key-value store at scale - Confluent (hard)
|Home/System Design/Confluent

Design an RSS News Feed Service

Confluent logo
Confluent
Apr 29, 2026, 12:00 AM
mediumSoftware EngineerOnsiteSystem Design
9
0

Design an RSS news feed service.

The service lets users subscribe to RSS sources, ingests articles from those sources on a schedule, stores article metadata, and serves each user a personalized feed of articles drawn from the sources they follow. You do not need to draw diagrams, but you must clearly explain the main components, the storage model, and the request flows.

This interview emphasizes two areas above all others: database table design and API design. Spend the bulk of your time there. Be concrete — name the tables, their columns, primary keys, the indexes that make the hot queries fast, and the endpoints with their request/response shapes.

Constraints & Assumptions

  • Single logical service; you may assume a relational primary store (e.g. PostgreSQL) with the option to offload feed items to a key-value/wide-column store at scale.
  • Eventual consistency is acceptable: a newly published article need not appear in feeds instantly — seconds-to-minutes of lag is fine.
  • Sources number in the hundreds of thousands; a popular source may be followed by a large fraction of users (fanout hot spot).
  • Feed reads dominate traffic and must be low-latency; ingestion is bursty (some sources publish many items at once).
  • RSS metadata is unreliable: guid may be missing, published_at may be absent or wrong, URLs may change, and the same article may be syndicated by multiple sources.
  • You may assume an external auth layer supplies an authenticated user_id .

Clarifying Questions to Ask

  • What is the read:write ratio, and roughly how many sources does a typical user subscribe to (tens? thousands)? This drives fanout-on-read vs. fanout-on-write.
  • How fresh must feeds be — is minutes-of-lag acceptable, or is near-real-time required?
  • Do we need full-text article content and search, or only metadata (title, summary, link)?
  • What per-user state must we track (read / saved / hidden), and is read-state required to be strongly consistent?
  • Are there a few "celebrity" sources with millions of subscribers we must plan a fanout hot spot around?
  • Do we control the article publishers, or are these arbitrary third-party RSS feeds (affecting fetch politeness, rate limits, and trust)?

What a Strong Answer Covers

  • Requirements split into functional (subscribe, ingest, dedup, serve paginated feed, per-user state) and non-functional (low-latency reads, idempotent ingestion, eventual consistency, source-failure isolation).
  • A concrete relational schema : every core table with columns, primary keys, and the specific indexes that back each hot query (subscription lookup, fanout to subscribers, feed read).
  • Idempotent, deduplicated ingestion : conditional fetches ( ETag / If-Modified-Since ), GUID/canonical-URL dedup, upsert-on-conflict, and a clear decoupling of fetch → parse → store stages.
  • A clear API surface with REST endpoints, request/response bodies, and cursor-based pagination on the feed endpoint.
  • An explicit feed-generation decision : fanout-on-read vs. fanout-on-write, the trade-offs, and a defended default (typically a hybrid).
  • Scaling and failure handling : per-domain fetch rate limiting, exponential backoff and dead-lettering for broken feeds, partitioning strategy, caching of hot feed pages, and source-level failure isolation.
  • Edge cases : missing GUIDs, missing/incorrect published_at , cross-source syndication, post-publication article edits, and unsubscribe-after-materialization cleanup.
  • Observability : the ingestion and serving metrics you'd track (fetch latency, parse-failure rate, duplicate rate, queue lag, feed-read p99).

Follow-up Questions

  • A single source suddenly gains millions of subscribers and publishes a burst of articles. How does your fanout-on-write path avoid melting? (Consider lazy/hybrid fanout, batching, and treating celebrity sources as pull-only.)
  • A user unsubscribes from a source they were already materialized into under fanout-on-write. How do you keep stale articles out of their feed without an expensive synchronous delete?
  • Two different sources syndicate the identical article. How do you detect the duplicate at ingestion time, and how do you decide which copy a user sees?
  • How would you add full-text search over article titles and summaries, and where does that index live relative to the primary store?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Confluent•More Software Engineer•Confluent Software Engineer•Confluent System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.