Build a Fuzzy Keyword Index
Company: Workday
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Onsite
Quick Answer: This question evaluates string-processing, text-indexing, approximate string-matching, and data-structure design skills for supporting many repeated queries on the same document.
Part 1: Exact Keyword Index with Starting Offsets
Constraints
- `0 <= len(text) <= 200000`
- `0 <= len(queries) <= 10000`
- Each query consists only of lowercase letters `a-z` and may repeat.
- Words in `text` are lowercase letters `a-z`; every other character is a separator.
- The answer for each query must be sorted in ascending order.
Examples
Input: ('cat,dog cat!dog', ['cat', 'dog', 'cow'])
Expected Output: [[0, 8], [4, 12], []]
Explanation: The words are `cat` at 0 and 8, and `dog` at 4 and 12. `cow` does not appear.
Input: ('a..ab,a', ['a', 'ab', 'b'])
Expected Output: [[0, 6], [3], []]
Explanation: `b` is not a whole word; it only appears inside `ab`.
Input: ('', ['a', 'b'])
Expected Output: [[], []]
Explanation: Edge case: empty document, so every query returns no matches.
Input: ('one two', [])
Expected Output: []
Explanation: Edge case: no queries.
Hints
- Scan the text once and detect word boundaries instead of searching separately for every query.
- A hash map from word -> list of starting offsets makes repeated exact lookups fast.
Part 2: Fuzzy Keyword Index with One Insertion or Deletion
Constraints
- `0 <= len(text) <= 200000`
- `0 <= len(queries) <= 10000`
- Each query is a non-empty lowercase string using only `a-z`.
- Words in `text` are lowercase letters `a-z`; every other character is a separator.
- A valid fuzzy match allows only exact match, one insertion, or one deletion. Replacements/substitutions are not allowed.
Examples
Input: ('cat cats ct cut scat at', ['cat'])
Expected Output: [[0, 4, 9, 16, 21]]
Explanation: `cat` matches exactly at 0, `cats` and `scat` by deleting one character from the document word, and `ct` and `at` by deleting one character from the query. `cut` is not included because that would require a substitution.
Input: ('cat cats ct cut scat at', ['cuts', 'cut', 'ca'])
Expected Output: [[12], [9, 12], [0]]
Explanation: `cuts` matches `cut`; `cut` matches itself and `ct`; `ca` matches `cat` because the document word can delete one character to become `ca`.
Input: ('a aa aaa b', ['aa'])
Expected Output: [[0, 2, 5]]
Explanation: Matches include `a`, `aa`, and `aaa`. Repeated deletion signatures should not create duplicate results.
Input: ('', ['cat'])
Expected Output: [[]]
Explanation: Edge case: empty document.
Hints
- Keep the exact index from Part 1, but also consider signatures created by deleting one character from a word.
- A query can match longer document words via a precomputed deletion-signature map, and shorter document words by generating all one-character deletions of the query itself.