PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Palo Alto Networks

Design streaming difference at scale

Last updated: May 3, 2026

Quick Overview

This question evaluates system design and streaming-data engineering competencies, covering membership-check strategies, caching and storage tiers, probabilistic data structures (Bloom filters), normalization and update propagation, fault tolerance, and scalability; it belongs to the System Design domain and tests practical application of distributed architecture patterns alongside conceptual understanding of trade-offs. It is commonly asked to assess an engineer's ability to architect a near-real-time anti-join between a high-throughput event stream and a very large reference set while reasoning about latency, throughput, correctness, state management, and cost-performance trade-offs without implementation details.

  • hard
  • Palo Alto Networks
  • System Design
  • Software Engineer

Design streaming difference at scale

Company: Palo Alto Networks

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Onsite

Suppose list1 is an unbounded stream and list2 may be too large to fit in memory. Design a system that outputs, in near real time, the elements from the stream that are not present in list2. Describe your architecture (ingestion, processing, storage), how to perform membership checks for list2 using an in-memory set, a cache like Redis, or an external database, how and when to use a Bloom filter and mitigate false positives, how to enforce consistent normalization, and how you will address updates to list2, back-pressure, fault tolerance, scalability, throughput/latency targets, and cost trade-offs.

Quick Answer: This question evaluates system design and streaming-data engineering competencies, covering membership-check strategies, caching and storage tiers, probabilistic data structures (Bloom filters), normalization and update propagation, fault tolerance, and scalability; it belongs to the System Design domain and tests practical application of distributed architecture patterns alongside conceptual understanding of trade-offs. It is commonly asked to assess an engineer's ability to architect a near-real-time anti-join between a high-throughput event stream and a very large reference set while reasoning about latency, throughput, correctness, state management, and cost-performance trade-offs without implementation details.

Palo Alto Networks logo
Palo Alto Networks
Sep 6, 2025, 12:00 AM
Software Engineer
Onsite
System Design
10
0

Design a Near Real-Time Anti-Join: Stream vs. Large Reference List

You are given:

  • list1: an unbounded, high-throughput event stream of keys (e.g., user IDs, IPs, URLs).
  • list2: a very large, frequently updated reference set of keys that may not fit in memory on a single machine.

Goal: Emit, with near real-time latency, the events from list1 whose keys are NOT present in list2.

Provide a detailed system design that includes:

1) Architecture

  • Ingestion: how events and list2 changes enter the system.
  • Processing: how the anti-join is performed at scale.
  • Storage: what stores act as source of truth and caches.

2) Membership Checking Strategies

Describe how you would perform membership checks for list2 using:

  • (a) An in-memory set (when feasible, and how to shard).
  • (b) A distributed cache (e.g., Redis) with pipelining/batching.
  • (c) An external database/NoSQL store as the authoritative source.

Explain how you would combine these layers (e.g., L1/L2/L3 lookups) for latency and cost.

3) Bloom Filter Usage

  • When and why to add a Bloom filter in front of the cache/DB.
  • How to size it (false-positive rate, memory, number of hashes) and place it (per-partition/global).
  • How to mitigate false positives (exact fallback, counters vs. rebuilds) and handle deletes.

4) Normalization Consistency

  • The canonicalization rules (e.g., case folding, trimming, Unicode normalization, formatting, hashing).
  • How to enforce consistent normalization across producers/consumers and stores.

5) Updates to list2

  • How list2 is maintained (snapshot + CDC, periodic rebuilds, or streaming updates).
  • Ordering/consistency between list1 events and list2 updates to avoid correctness gaps.
  • Handling adds and deletes, cache invalidation, and Bloom filter maintenance.

6) Back-Pressure and Flow Control

  • How the system reacts when downstream is slow (e.g., bounded queues, asynchronous I/O, scaling).

7) Fault Tolerance and Correctness

  • Delivery semantics (at-least/at-most/exactly-once) and idempotency strategy.
  • Recovery for processor, cache, and database failures.

8) Scalability

  • Partitioning/sharding strategy and state distribution.
  • Vertical vs. horizontal scaling, hot-key mitigation.

9) SLOs and Cost Trade-offs

  • Target throughput/latency (e.g., p95/p99) and how you would size hardware.
  • Memory/compute/network trade-offs for in-memory sets, Bloom filters, Redis, and databases.

Your answer should include a step-by-step data flow, small numeric examples (e.g., Bloom filter sizing), key formulas, and guardrails to validate correctness and performance.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Palo Alto Networks•More Software Engineer•Palo Alto Networks Software Engineer•Palo Alto Networks System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.