Design a trending-articles platform

Q: Design a trending-articles platform

This question evaluates system design competence for building large-scale, near-real-time trending and feed systems, testing skills in streaming ingestion, event modeling, ranking algorithms, caching, consistency and fault-tolerance, capacity planning, and operational observability within the System Design domain.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Q: What difficulty level is this interview question?

This is a hard difficulty System Design question, commonly asked during Onsite rounds at Figma.

Q: What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Figma during technical interviews.

Question

Design the backend that surfaces trending articles in near real time for a large-scale consumer product. The system must serve global, regional (country/region), and category-specific trending feeds, and absorb hundreds of thousands of engagement events per second at peak (views, clicks, dwell time, shares).

"Trending" should reflect recent acceleration in engagement, not just all-time popularity, so an article that suddenly spikes in the last few minutes should be able to rise above an evergreen article with a high lifetime count. Reads must stay fast and stable enough to paginate, while new trends become visible within seconds of the underlying activity.

Your task spans the full pipeline: define requirements and scale, then design (1) event ingestion and storage, (2) the streaming/batch compute that turns raw events into ranked top-K lists, (3) caching, pagination, and the ranking-service read path, (4) the consistency model and fault tolerance, (5) capacity planning and sharding, and (6) the concrete technology choices with trade-offs. Use back-of-the-envelope estimates and small worked examples to justify the design.

Constraints & Assumptions

State your own numbers if the interviewer does not give them; the design must be self-consistent with whatever you choose. A reasonable working set:

Scale: ~100M DAU; a couple of trending-feed reads per user per day; ~30 engagement events per user per day. This implies roughly $10^4$ feed QPS and $10^5$ – $2\times10^5$ events/sec at peak (apply a 5× peak factor to averages).
Freshness: new trends visible within seconds (target median < 5s, P99 < ~30s from event to list visibility).
Read latency: server-side P99 in the low hundreds of milliseconds (e.g., P50 ~50ms, P99 ~200ms).
Availability: read path ≥ 99.9% (multi-AZ, multi-region active-active for reads); ingestion can briefly degrade without dropping events.
Retention: hot top-K lists for minutes, aggregated counters warm for weeks, raw events cold for months (backfill/ML).
Segments: the product slices feeds by global , region:<X> , locale:<X> , category:<X> , and combinations (e.g., locale:en-US|category:design ).

Clarifying Questions to Ask

Scope the whole problem before designing:

Definition of "trending": absolute decayed engagement, or acceleration relative to a baseline? Which signals count (views vs. clicks vs. dwell vs. shares), and how are they weighted?
Personalization: is the feed purely global/segment-ranked for everyone, or is light per-user re-ranking in scope?
Freshness vs. stability: how fresh must results be, and how much list churn between refreshes is acceptable (does a user scrolling expect a stable ordering)?
Abuse model: what manipulation are we defending against (bot farms, share rings, single-IP floods), and is there an editorial/safety layer that can hard-block articles?
Locale/fallback: what does a sparse segment (e.g., a small locale×category) do when it lacks enough volume — fall back to a broader segment, or show fewer items?
Read shape: page size limits, max depth of pagination, and whether clients need a stable snapshot across pages.

Part 1 — Event Ingestion & Storage

Design how raw engagement events enter the system and where they live. Cover the event schema and enrichment (what fields, how region/locale/bot signals are attached at the edge), the hot/warm/cold storage tiers and what each holds, and the partitioning and indexing strategy for looking up state by article, locale, and category.

What This Part Should Cover

A concrete event schema (Avro/Protobuf) with event_id , event_time vs ingest_time , article_id , locale/region, event_type , dwell_ms , and a trust/bot field — plus a schema_version .
Three tiers with clear roles: hot serving store (top-K lists), warm aggregate/counter store, cold raw-event store; and what retention each gets.
Partitioning/indexing that supports per- (segment, article) updates and top-K lookups without hot partitions, including how article metadata is joined in.

Part 2 — Streaming & Batch Compute

Design the compute that turns the raw event stream into ranked top-K lists per segment. Cover sliding/decayed windows (how a recent spike outranks an evergreen article), top-K per segment and how partial results merge, handling late/out-of-order events, deduplication, and watermarks, and the backfill/reprocessing path.

What This Part Should Cover

A decay/acceleration scoring formula with a worked numeric example, and how signals combine into one composite score.
A top-K strategy with a hierarchical merge for hot segments and a bounded-memory heavy-hitter structure.
Concrete late-event / dedup / watermark handling, and a backfill path that recomputes from cold storage and swaps in corrected results atomically.

Part 3 — Caching, Pagination & the Ranking-Service Read Path

Design the read path that serves GET /trending. Cover how top lists are cached and refreshed, cursor design for stable pagination, the ranking service that assembles a response (filters, diversity, optional re-rank), and the personalization-vs-global trade-off.

What This Part Should Cover

A cache layout (e.g., Redis sorted sets) with TTLs and a streaming-driven refresh that only republishes when the list materially changes.
A cursor design giving stable pagination across list versions, plus an example request/response (scope, locale, category, page_size, page_token).
A clear stance on global vs. personalized ranking with the latency/cost trade-off, and safety/editorial filtering + result diversity in the response path.

Part 4 — Consistency Model & Fault Tolerance

State the delivery and consistency guarantees end to end (event ingestion → aggregation → snapshot → read), how idempotency is achieved, and the failure modes with graceful degradation.

What This Part Should Cover

The chosen guarantee per stage (at-least-once + dedup vs. exactly-once sink) and how idempotent upserts are keyed.
Concrete failure modes (stream lag, store throttle, region outage) and the degraded behavior for each (serve stale/last-good list, widen the window, fall back to a broader segment).
Multi-AZ / multi-region posture and how checkpointing + versioned snapshots enable safe recovery and rollback.

Part 5 — Capacity Planning & Sharding

Produce rough capacity estimates for the main components and a sharding strategy that avoids hot spots.

What This Part Should Cover

Bus/partition sizing from peak EPS and payload size (with replication factor), and stream-processing sizing (per-core throughput → core/memory count with HA headroom).
Hot-store and warm-store sizing derived from segment count, K, and active article×segment pairs.
A sharding scheme (partition by hash(segment, article) ; two-level aggregation for hot segments; even hash-slot distribution / key tagging in the cache) and how a single hot key is split and re-merged.

Part 6 — Technology Choices & Trade-offs

Justify the concrete technologies for each layer and name the trade-off behind each pick.

What This Part Should Cover

A defended pick per layer (message bus; stream engine; hot store; warm store; cold store/batch; metadata/search) with the rejected alternative.
Why the stream engine choice fits event-time processing and large state (e.g., Flink vs. Spark Structured Streaming vs. Kafka Streams).
Why the hot store fits atomic top-K + versioning (e.g., Redis ZSETs vs. Aerospike/Memcached), and managed-vs-self-managed ops trade-offs for the warm store.

What a Strong Answer Covers

Across all parts, a strong answer is internally consistent and reasons from the numbers rather than listing components:

One coherent dataflow. Ingestion → bus → stream compute → warm aggregates + hot snapshots → ranking read path, with the same segment_id / article_id keying threaded through every layer.
Freshness vs. stability is treated as the central tension — decay/acceleration for freshness, versioned snapshots and minimum-support thresholds for stability — and the candidate states where they land.
Numbers drive the design: estimates in Parts 2/5 are consistent with the scale assumed up front, and capacity is sized with explicit headroom.
Abuse and editorial control are designed in , not bolted on: trust-weighted events, per-IP/source caps, hard editorial blocks, and retroactive downweighting via reprocessing.
Observability and guardrails: ingestion lag, window completeness, cache hit ratio, latency percentiles, version-churn, and feature flags to tune half-life / K / acceleration weight.

Follow-up Questions

A coordinated bot ring inflates one article in a single region for ~5 minutes. Walk through every layer that should catch or dampen it, and what gets corrected retroactively.
The streaming job falls 10 minutes behind during a traffic spike. What does the read API serve, what alerts fire, and how do you recover without publishing a wrong list?
Product now wants per-user personalized trending at full scale. What changes, what breaks, and where does the global/segment design stop being sufficient?
How would you tune the decay half-life $H$ and the acceleration weight, and how would you measure whether a change made the feed "better" rather than just more volatile?

Design a trending-articles platform

Quick Overview

Design a trending-articles platform

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Event Ingestion & Storage

What This Part Should Cover

Part 2 — Streaming & Batch Compute

What This Part Should Cover

What This Part Should Cover

Part 4 — Consistency Model & Fault Tolerance

What This Part Should Cover

Part 5 — Capacity Planning & Sharding

What This Part Should Cover

Part 6 — Technology Choices & Trade-offs

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP

Design a trending-articles platform

Quick Overview

Design a trending-articles platform

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Event Ingestion & Storage

What This Part Should Cover

Part 2 — Streaming & Batch Compute

What This Part Should Cover

What This Part Should Cover

Part 4 — Consistency Model & Fault Tolerance

What This Part Should Cover

Part 5 — Capacity Planning & Sharding

What This Part Should Cover

Part 6 — Technology Choices & Trade-offs

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP

Design a trending-articles platform

Quick Overview