Design a real-time favorites service at scale

Q: Design a real-time favorites service at scale

This is a System Design interview question from Roblox for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Design: Real-Time Favorites/Unfavorites Service at High Scale

Context

Design a backend service that lets users favorite/unfavorite items (e.g., posts, products) and exposes each item's favorite count in near real time. The system must support very high traffic with low latency and strong reliability. Assume clients are web/mobile, traffic is global, and items can be extremely skewed in popularity.

Functional Requirements

Users can favorite or unfavorite an item exactly once (per user-item pair).
Show favorite counts on item pages in near real time.
Provide bulk count lookups for feeds/lists.
Idempotent APIs resilient to retries and network issues.
Support item deletion and recount/backfill.

Non-Functional Requirements

Scale: ~1,000,000 QPS reads (count lookups), ~100,000 QPS writes (favorite/unfavorite).
Latency targets: reads p50 ≤ 10 ms, p95 ≤ 20 ms, p99 ≤ 50 ms (from edge); writes p95 ≤ 50 ms to accept and reflect within ≤ 1–2 s globally.
Availability: ≥ 99.99% for reads; ≥ 99.9% for writes.
Consistency: event-driven eventual consistency for counts (≤ 1–2 s). Strong per-user semantics (cannot favorite twice; unfavorite is a no-op if not favorited).

Specify

APIs and request/response contracts.
Data model and indexing.
Consistency model and latency SLOs.
Idempotency/dedup and exactly-once semantics.
Counter design (sharded/aggregated) and hot-key mitigation.
Caching and invalidation.
Storage choices and partitioning.
Streaming/batch aggregation for near-real-time counts.
Multi-region deployment, replication, and failover.
Handling unfavorite, deletes, recount/backfill.
Rate limiting and abuse prevention.
Security and authorization.
Observability (metrics, alerts), capacity planning, and cost trade-offs.
Test strategy and load testing plan.