Design streaming difference at scale

Q: Design streaming difference at scale

This is a System Design interview question from Palo Alto Networks for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Design a Near Real-Time Anti-Join: Stream vs. Large Reference List

You are given:

list1: an unbounded, high-throughput event stream of keys (e.g., user IDs, IPs, URLs).
list2: a very large, frequently updated reference set of keys that may not fit in memory on a single machine.

Goal: Emit, with near real-time latency, the events from list1 whose keys are NOT present in list2.

Provide a detailed system design that includes:

1) Architecture

Ingestion: how events and list2 changes enter the system.
Processing: how the anti-join is performed at scale.
Storage: what stores act as source of truth and caches.

2) Membership Checking Strategies

Describe how you would perform membership checks for list2 using:

(a) An in-memory set (when feasible, and how to shard).
(b) A distributed cache (e.g., Redis) with pipelining/batching.
(c) An external database/NoSQL store as the authoritative source.

Explain how you would combine these layers (e.g., L1/L2/L3 lookups) for latency and cost.

3) Bloom Filter Usage

When and why to add a Bloom filter in front of the cache/DB.
How to size it (false-positive rate, memory, number of hashes) and place it (per-partition/global).
How to mitigate false positives (exact fallback, counters vs. rebuilds) and handle deletes.

4) Normalization Consistency

The canonicalization rules (e.g., case folding, trimming, Unicode normalization, formatting, hashing).
How to enforce consistent normalization across producers/consumers and stores.

5) Updates to list2

How list2 is maintained (snapshot + CDC, periodic rebuilds, or streaming updates).
Ordering/consistency between list1 events and list2 updates to avoid correctness gaps.
Handling adds and deletes, cache invalidation, and Bloom filter maintenance.

6) Back-Pressure and Flow Control

How the system reacts when downstream is slow (e.g., bounded queues, asynchronous I/O, scaling).

7) Fault Tolerance and Correctness

Delivery semantics (at-least/at-most/exactly-once) and idempotency strategy.
Recovery for processor, cache, and database failures.

8) Scalability

Partitioning/sharding strategy and state distribution.
Vertical vs. horizontal scaling, hot-key mitigation.

9) SLOs and Cost Trade-offs

Target throughput/latency (e.g., p95/p99) and how you would size hardware.
Memory/compute/network trade-offs for in-memory sets, Bloom filters, Redis, and databases.

Your answer should include a step-by-step data flow, small numeric examples (e.g., Bloom filter sizing), key formulas, and guardrails to validate correctness and performance.