System Design: News Aggregation Platform (Google News–like)
Context
Design a multi-region news aggregation platform that ingests content from many publishers and serves near-real-time feeds to users on web/mobile. The system must support both pull-based crawling (RSS/Sitemaps) and push-based publisher updates, and provide deduped, categorized, and personalized news feeds with strong operational guardrails.
Requirements
-
Ingestion Layer
-
Publisher onboarding and authentication (domain verification, keys/tokens)
-
Source discovery and scheduling (RSS/Atom, Sitemaps, crawl cadence)
-
Politeness and rate limiting (robots.txt, per-host concurrency, backoff)
-
Fetcher architecture (distributed, retries, content negotiation)
-
Schema normalization and enrichment (canonicalization, text extraction, language detection, category tagging)
-
Deduplication and near-duplicate clustering (story grouping)
-
Near-real-time updates and historical backfill
-
Idempotency and exactly-once semantics across retries and replays
-
Retry strategies and error handling (transient vs permanent)
-
Spam/abuse filtering
-
Copyright and robots compliance
-
Monitoring, alerting, and auditing
-
Storage and Retrieval
-
Article store (raw and normalized), versioning
-
Indexing for search and feed retrieval
-
Feed generation (global, topic, locale, personalized)
-
Personalization and ranking
-
Freshness and caching
-
Cross-Cutting Specifications
-
APIs (publisher/admin, ingestion webhooks, consumer feeds/search)
-
Data models (Publisher, SourceFeed, Article, Cluster, Events, etc.)
-
Consistency guarantees
-
Multi-region scalability and partitioning
-
SLAs/SLOs
-
Capacity estimates and scaling plan