System Design: Near Real-Time Trending Articles
Context
Design a backend that surfaces trending articles in near real time for a large-scale consumer product. The system should support global, regional, and category-specific feeds, and scale to hundreds of thousands of events per second at peak. You will specify requirements, constraints, and propose an end-to-end architecture.
Functional Requirements to Clarify
-
Feed surfaces and APIs
-
Global, local (country/region), and category feeds
-
Optional personalization vs. purely global ranking
-
Pagination (cursor-based), page size limits, and stable ordering
-
Language and locale support (e.g., en-US) and fallback behavior
-
Freshness and latency
-
Feed freshness target (e.g., new trends visible within X seconds)
-
Read latency SLAs (P50/P95/P99)
-
Abuse and spam resilience
-
Bot and fraud mitigation, deduplication, rate limits, downweighting
-
Editorial blocks and safety filters
-
Observability and controls
-
Feature flags, explainability of ranking, metrics and alerting
Constraints to Define (provide your assumptions if not given)
-
Scale assumptions
-
DAU, sessions per user per day
-
Feed QPS (avg and peak), event rates (views, clicks, dwell, shares)
-
Peak write rates to caches and data stores
-
SLOs and availability
-
P99 read latency target
-
Freshness and end-to-end pipeline delay (P99)
-
Availability target (e.g., 99.9% or higher)
-
Data retention
-
Hot, warm, and cold retention periods
What to Design and Deliver
-
Ingestion and storage architecture for user and content events (views, clicks, dwell time, shares)
-
Event schema design and enrichment
-
Hot/warm/cold storage tiers and partitioning
-
Indexing strategy for lookups by article, locale, category
-
Streaming and batch compute
-
Sliding windows and time-decayed counts
-
Top-K per segment (global, locale, category) and merging strategies
-
Handling late/out-of-order events, deduplication, and watermarks
-
Backfill/reprocessing strategy
-
Caching, invalidation, and ranking service integration
-
How to cache and update top lists; cursor design for pagination
-
Personalization vs. global ranking trade-offs
-
Consistency model and fault tolerance
-
Exactly-once or at-least-once guarantees; idempotency
-
Failure modes and graceful degradation
-
Capacity planning and sharding
-
Kafka/stream partitions, cache shards, storage sizing
-
Rough capacity estimates and headroom
-
Technology choices and trade-offs
-
Justify choices (e.g., Kafka vs. Kinesis, Flink vs. Spark, Redis vs. Aerospike, DynamoDB vs. Cassandra, ClickHouse, OpenSearch, S3/Parquet, etc.)
Provide diagrams verbally (component-by-component), API examples, formulas (e.g., exponential decay), small numeric examples, and clear justifications.