Implement TF-IDF scoring for documents
Company: Apple
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Quick Answer: This question evaluates understanding of information retrieval and text representation concepts—specifically TF–IDF weighting, tokenization, term frequency and document frequency—and measures competency in implementing an efficient scoring and ranking algorithm.
Constraints
- 1 <= len(docs) <= 10^4
- The total number of tokens across all documents after tokenization is at most about 2 * 10^6
- Document and query text may contain letters, digits, spaces, and punctuation
- Tokenization must lowercase text, split on non-alphanumeric characters, and ignore empty tokens
Examples
Input: (['red apple', 'green apple', 'banana'], 'apple')
Expected Output: [0, 1, 2]
Explanation: Documents 0 and 1 each contain 'apple' once, so they tie with a positive score. Document 2 gets score 0, and ties are broken by smaller index.
Input: (['Cat cat dog!', 'dog; mouse', 'cat'], 'cat dog')
Expected Output: [0, 1, 2]
Explanation: After tokenization and lowercasing, document 0 has tf(cat)=2 and tf(dog)=1, so it scores highest. Documents 1 and 2 each match one query term once, so they tie and index 1 comes before 2.
Input: (['v1 release notes', 'release v2', 'patch'], 'security')
Expected Output: [0, 1, 2]
Explanation: No document contains 'security', so every document gets score 0. The final order is by index.
Input: (['apple apple banana', 'apple banana banana', 'banana'], 'apple apple')
Expected Output: [0, 1, 2]
Explanation: The query contains 'apple' twice, so its contribution is doubled. Document 0 has two apples and ranks above document 1, which has one.
Input: (['!!!'], 'anything')
Expected Output: [0]
Explanation: The only document tokenizes to an empty list, so its score is 0. There is only one valid index.
Input: (['rare common', 'common common', 'common'], 'rare common')
Expected Output: [0, 1, 2]
Explanation: The token 'rare' appears in fewer documents than 'common', so it gets a higher IDF. Document 0 contains both terms and ranks above document 1, even though document 1 contains 'common' twice.
Input: (['alpha', 'beta'], '!!!')
Expected Output: [0, 1]
Explanation: The query tokenizes to no tokens, so every score is 0 and the indices are returned in ascending order.
Hints
- You only need document frequencies for terms that actually appear in the query.
- To save memory, compute document frequencies in one pass, then compute scores in a second pass once IDF values are known.