Design IP/CIDR rule matcher
Company: Databricks
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates understanding of IPv4 addressing, CIDR and numeric range representations, plus competency in designing efficient data structures and algorithms for low-latency rule matching with dynamic updates.
Part 1: IPv4 Rule Matcher with Add, Remove, and Query
Constraints
- 0 <= len(operations) <= 100000
- All IPv4 addresses are valid dotted-decimal IPv4 strings
- CIDR prefix lengths are between 0 and 32 inclusive
- Actions are either 'accept' or 'deny'
- For range rules, start and end fit in the IPv4 space; treat the rule as inclusive
Examples
Input: [('add_cidr', '10.0.0.0/8', 'accept'), ('add_cidr', '10.1.0.0/16', 'deny'), ('query', '10.1.2.3'), ('query', '10.2.3.4'), ('query', '11.0.0.1')]
Expected Output: ['deny', 'accept', 'none']
Explanation: The /16 rule is more specific than the /8 rule for 10.1.2.3. The address 10.2.3.4 matches only the /8. The address 11.0.0.1 matches nothing.
Input: [('add_cidr', '192.168.1.1/32', 'deny'), ('add_range', '192.168.1.1', '192.168.1.1', 'accept'), ('query', '192.168.1.1')]
Expected Output: ['accept']
Explanation: Both rules cover exactly one IP, so they have equal specificity. The newer rule wins.
Input: [('add_range', '1.1.1.0', '1.1.1.255', 'deny'), ('add_range', '1.1.1.0', '1.1.1.255', 'accept'), ('query', '1.1.1.1'), ('remove_range', '1.1.1.0', '1.1.1.255', 'accept'), ('query', '1.1.1.1'), ('remove_range', '1.1.1.0', '1.1.1.255', 'deny'), ('query', '1.1.1.1')]
Expected Output: ['accept', 'deny', 'none']
Explanation: Two identical ranges are active, so newest wins first. After removing the newer one, the older deny remains. After removing that too, no rule matches.
Input: [('remove_cidr', '0.0.0.0/0', 'deny'), ('query', '8.8.8.8'), ('add_cidr', '0.0.0.0/0', 'deny'), ('add_cidr', '255.255.255.255/32', 'accept'), ('query', '255.255.255.255'), ('query', '0.0.0.0')]
Expected Output: ['none', 'accept', 'deny']
Explanation: Removing a non-existent rule does nothing. The /32 for 255.255.255.255 is more specific than /0, while 0.0.0.0 matches only the /0 rule.
Hints
- Convert every IPv4 address to a 32-bit integer before comparing ranges. Use explicit parentheses when shifting and masking.
- Because the full operation list is known in advance, compress only the IPs that are actually queried, then support range updates and point queries on those coordinates.
Part 2: Choose the Best Rule-Index Data Structure for Firewall Workloads
Constraints
- 1 <= len(scenarios) <= 100000
- All count fields are integers between 0 and 10^9
- coordinate_count is an integer between 0 and 10^9
- Use L(x) = 0 if x <= 1, otherwise ceil(log2(x))
- If multiple strategies have the same cost, use priority: sorted_disjoint_intervals < ordered_map < interval_tree < segment_tree < radix_trie < hybrid(...)
Examples
Input: [{'prefix_rules': 0, 'range_rules': 8, 'prefix_updates': 0, 'range_updates': 0, 'queries': 100, 'disjoint_ranges': True, 'coordinate_count': 0}]
Expected Output: ['sorted_disjoint_intervals']
Explanation: This is a pure static disjoint-range workload, which is exactly what sorted disjoint intervals are best at under the given model.
Input: [{'prefix_rules': 10, 'range_rules': 0, 'prefix_updates': 5, 'range_updates': 0, 'queries': 20, 'disjoint_ranges': False, 'coordinate_count': 0}]
Expected Output: ['radix_trie']
Explanation: This is a pure prefix workload, and only the radix trie directly targets prefixes without converting them to ranges.
Input: [{'prefix_rules': 50, 'range_rules': 10, 'prefix_updates': 10, 'range_updates': 5, 'queries': 100, 'disjoint_ranges': True, 'coordinate_count': 50000}]
Expected Output: ['hybrid(ordered_map+radix_trie)']
Explanation: For the mixed workload, the trie handles prefixes and the ordered map is the cheapest valid range structure, so the hybrid beats a full segment tree.
Input: [{'prefix_rules': 5, 'range_rules': 50, 'prefix_updates': 1, 'range_updates': 20, 'queries': 30, 'disjoint_ranges': False, 'coordinate_count': 64}]
Expected Output: ['segment_tree']
Explanation: The coordinate count is small enough that a segment tree over compressed coordinates becomes cheaper than the hybrid option.
Input: [{'prefix_rules': 0, 'range_rules': 0, 'prefix_updates': 0, 'range_updates': 0, 'queries': 0, 'disjoint_ranges': True, 'coordinate_count': 0}]
Expected Output: ['sorted_disjoint_intervals']
Explanation: All valid strategies have zero cost on an empty workload, so the required tie-break picks sorted_disjoint_intervals.
Hints
- First classify each scenario as pure range, pure prefix, mixed, or empty. Different strategy sets are valid in different categories.
- Avoid floating-point logs. In Python, ceil(log2(x)) for x > 1 can be computed as (x - 1).bit_length().