Build a Searchable Sentence Index
Company: Amazon
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Onsite
Quick Answer: This question evaluates a candidate's ability to design and implement in-memory text indexes and related data structures, covering inverted indexing, tokenization and case-insensitive word matching, update/deletion handling, and analysis of time and space complexity.
Constraints
- 1 <= len(operations) <= 200000
- The total number of characters across all added texts and search words is at most 10^6
- -2^31 <= id <= 2^31 - 1
- operations and arguments have the same length
Examples
Input: (['addSentence', 'addSentence', 'search', 'deleteSentence', 'search', 'search'], [(2, 'Hello, world! hello'), (1, 'World of code.'), ('world',), (2,), ('hello',), ('world',)])
Expected Output: [[1, 2], [], [1]]
Explanation: Matching is case-insensitive, 'hello' in sentence 2 is counted once, ids are returned sorted, and after deleting sentence 2 only sentence 1 still matches 'world'.
Input: (['addSentence', 'addSentence', 'search', 'search'], [(10, 'red-blue red'), (3, 'Blue? green; red!'), ('red',), ('blue',)])
Expected Output: [[3, 10], [3, 10]]
Explanation: Non-alphanumeric characters split words, so 'red-blue' contributes both 'red' and 'blue'. Sentence 10 still appears only once per matching word.
Input: (['addSentence', 'search', 'deleteSentence', 'search'], [(5, ''), ('anything',), (5,), ('anything',)])
Expected Output: [[], []]
Explanation: An empty sentence contributes no indexed words, so searches return empty results before and after deletion.
Input: (['addSentence', 'addSentence', 'search', 'search', 'deleteSentence', 'search'], [(7, 'Version2 is better than v1.'), (4, 'v1, V2, and version2'), ('version2',), ('v1',), (4,), ('v2',)])
Expected Output: [[4, 7], [4, 7], []]
Explanation: Alphanumeric tokens can contain digits. After deleting sentence 4, the word 'v2' is no longer present in any sentence.
Input: (['addSentence', 'addSentence', 'search', 'deleteSentence', 'search'], [(-1, 'Alpha beta'), (2, 'beta gamma'), ('BETA',), (-1,), ('alpha',)])
Expected Output: [[-1, 2], []]
Explanation: Ids may be negative, search is case-insensitive, and results must still be sorted in ascending numeric order.
Hints
- An inverted index maps each normalized word to the sentence ids that contain it.
- To delete efficiently, also store for each sentence id the set of distinct normalized words that were extracted from that sentence.