Design a Near Real-Time Anti-Join: Stream vs. Large Reference List
You are given:
-
list1: an unbounded, high-throughput event stream of keys (e.g., user IDs, IPs, URLs).
-
list2: a very large, frequently updated reference set of keys that may not fit in memory on a single machine.
Goal: Emit, with near real-time latency, the events from list1 whose keys are NOT present in list2.
Provide a detailed system design that includes:
1) Architecture
-
Ingestion: how events and list2 changes enter the system.
-
Processing: how the anti-join is performed at scale.
-
Storage: what stores act as source of truth and caches.
2) Membership Checking Strategies
Describe how you would perform membership checks for list2 using:
-
(a) An in-memory set (when feasible, and how to shard).
-
(b) A distributed cache (e.g., Redis) with pipelining/batching.
-
(c) An external database/NoSQL store as the authoritative source.
Explain how you would combine these layers (e.g., L1/L2/L3 lookups) for latency and cost.
3) Bloom Filter Usage
-
When and why to add a Bloom filter in front of the cache/DB.
-
How to size it (false-positive rate, memory, number of hashes) and place it (per-partition/global).
-
How to mitigate false positives (exact fallback, counters vs. rebuilds) and handle deletes.
4) Normalization Consistency
-
The canonicalization rules (e.g., case folding, trimming, Unicode normalization, formatting, hashing).
-
How to enforce consistent normalization across producers/consumers and stores.
5) Updates to list2
-
How list2 is maintained (snapshot + CDC, periodic rebuilds, or streaming updates).
-
Ordering/consistency between list1 events and list2 updates to avoid correctness gaps.
-
Handling adds and deletes, cache invalidation, and Bloom filter maintenance.
6) Back-Pressure and Flow Control
-
How the system reacts when downstream is slow (e.g., bounded queues, asynchronous I/O, scaling).
7) Fault Tolerance and Correctness
-
Delivery semantics (at-least/at-most/exactly-once) and idempotency strategy.
-
Recovery for processor, cache, and database failures.
8) Scalability
-
Partitioning/sharding strategy and state distribution.
-
Vertical vs. horizontal scaling, hot-key mitigation.
9) SLOs and Cost Trade-offs
-
Target throughput/latency (e.g., p95/p99) and how you would size hardware.
-
Memory/compute/network trade-offs for in-memory sets, Bloom filters, Redis, and databases.
Your answer should include a step-by-step data flow, small numeric examples (e.g., Bloom filter sizing), key formulas, and guardrails to validate correctness and performance.