Implement Prefix Match Filter
Company: Datadog
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Onsite
Quick Answer: This question evaluates understanding of prefix-based string matching, data structure design (hash-based sets and trie/prefix tree), and algorithmic competencies including efficient in-memory storage and deduplication of matches.
Part 1: Hash-Based Prefix Match Filter
Constraints
- 0 <= len(prefixes), len(strings) <= 2000
- Each string length is between 0 and 200
- Strings contain lowercase English letters and may be empty
- A matching string should appear only once in the output even if it appears multiple times in the input
Examples
Input: (['ab', 'cat', 'pre'], ['abc', 'dog', 'prefix', 'cater', 'zoo'])
Expected Output: ['abc', 'prefix', 'cater']
Explanation: 'abc' starts with 'ab', 'prefix' starts with 'pre', and 'cater' starts with 'cat'.
Input: (['a', 'ab'], ['ab', 'abacus', 'ab', 'b'])
Expected Output: ['ab', 'abacus']
Explanation: Both 'ab' and 'abacus' match. The second 'ab' is omitted because output strings must be unique.
Input: ([''], ['', 'x', 'abc', 'x'])
Expected Output: ['', 'x', 'abc']
Explanation: The empty prefix matches every string, and duplicates in the input are returned once.
Input: ([], ['a', ''])
Expected Output: []
Explanation: With no stored prefixes, no string can match.
Hints
- A string matches if any prefix of that string is present in a hash set of stored prefixes.
- Handle the empty stored prefix carefully: it matches every string, including the empty string.
Part 2: Trie-Based Prefix Match Filter
Constraints
- 0 <= len(prefixes), len(strings) <= 10000
- The total number of characters across all prefixes and strings is at most 2 * 10^5
- Strings contain lowercase English letters and may be empty
- A matching string should appear only once in the output
Examples
Input: (['ab', 'cat', 'pre'], ['abc', 'dog', 'prefix', 'cater', 'zoo'])
Expected Output: ['abc', 'prefix', 'cater']
Explanation: The trie allows early stopping as soon as a terminal prefix node is reached.
Input: (['a', 'ab', 'abc'], ['abc', 'abd', 'b', 'abc'])
Expected Output: ['abc', 'abd']
Explanation: Both 'abc' and 'abd' match because they start with stored prefix 'a'. The repeated 'abc' is returned once.
Input: ([''], ['', 'hello', 'hello'])
Expected Output: ['', 'hello']
Explanation: The empty prefix matches every string.
Input: (['car', 'dog'], [])
Expected Output: []
Explanation: An empty input list produces an empty output list.
Input: (['x'], ['', 'x', 'xy'])
Expected Output: ['x', 'xy']
Explanation: The empty string does not match unless the empty prefix is stored.
Hints
- Each trie node should track whether a stored prefix ends at that node.
- The root itself can represent the empty prefix.
Part 3: Compare Hash-Based and Trie-Based Prefix Filter Costs
Constraints
- 0 <= len(prefixes), len(strings) <= 10000
- The total number of characters across all prefixes and strings is at most 2 * 10^5
- Strings contain lowercase English letters and may be empty
- Duplicate prefixes do not increase storage cost because storage is measured on distinct stored prefixes
Examples
Input: (['ab', 'cat', 'pre'], ['abc', 'dog', 'prefix', 'cater', 'zoo'])
Expected Output: {'hash_time': 27, 'trie_time': 10, 'hash_space': 8, 'trie_space': 8, 'faster_time': 'trie', 'smaller_space': 'tie'}
Explanation: The trie stops after 2 or 3 inspected characters on matching words and after 1 inspected character on obvious mismatches, while the hash method pays for all generated prefixes.
Input: (['a', 'ab', 'abc', 'ab'], ['abc', 'abd', 'b', ''])
Expected Output: {'hash_time': 3, 'trie_time': 3, 'hash_space': 6, 'trie_space': 3, 'faster_time': 'tie', 'smaller_space': 'trie'}
Explanation: Shortest stored prefix 'a' matches both 'abc' and 'abd' immediately. The trie saves space by sharing nodes.
Input: (['', 'cat'], ['', 'dog', 'cat'])
Expected Output: {'hash_time': 0, 'trie_time': 0, 'hash_space': 3, 'trie_space': 3, 'faster_time': 'tie', 'smaller_space': 'tie'}
Explanation: The empty prefix makes every query an immediate match in both approaches.
Input: ([], ['a', 'bb'])
Expected Output: {'hash_time': 4, 'trie_time': 2, 'hash_space': 0, 'trie_space': 0, 'faster_time': 'trie', 'smaller_space': 'tie'}
Explanation: With no stored prefixes, the trie fails on the first character of each non-empty string, but the hash method still pays for every generated prefix.
Input: (['car', 'cat', 'can'], ['cap', 'dog', 'c'])
Expected Output: {'hash_time': 13, 'trie_time': 5, 'hash_space': 9, 'trie_space': 5, 'faster_time': 'trie', 'smaller_space': 'trie'}
Explanation: The trie shares the 'ca' path across all three prefixes and also stops early on mismatches.
Hints
- A trie can tell you the earliest matching prefix length for a string and also where the first mismatch happens.
- If the first hash-based match occurs at length k, the total work is the triangular number k * (k + 1) // 2.