Implement TF-IDF scoring for documents
Company: Apple
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Quick Answer: This question evaluates understanding of information retrieval and text representation concepts—specifically TF–IDF weighting, tokenization, term frequency and document frequency—and measures competency in implementing an efficient scoring and ranking algorithm.
Constraints
- 1 <= len(docs) <= 10^4
- The total number of tokens across all documents after tokenization is at most about 2 * 10^6
- Document and query text may contain letters, digits, spaces, and punctuation
- Tokenization must lowercase text, split on non-alphanumeric characters, and ignore empty tokens
Examples
Input: (['red apple', 'green apple', 'banana'], 'apple')
Expected Output: [0, 1, 2]
Explanation: Documents 0 and 1 each contain 'apple' once, so they tie with a positive score. Document 2 gets score 0, and ties are broken by smaller index.
Input: (['Cat cat dog!', 'dog; mouse', 'cat'], 'cat dog')
Expected Output: [0, 1, 2]
Explanation: After tokenization and lowercasing, document 0 has tf(cat)=2 and tf(dog)=1, so it scores highest. Documents 1 and 2 each match one query term once, so they tie and index 1 comes before 2.
Input: (['v1 release notes', 'release v2', 'patch'], 'security')
Expected Output: [0, 1, 2]
Explanation: No document contains 'security', so every document gets score 0. The final order is by index.
Input: (['apple apple banana', 'apple banana banana', 'banana'], 'apple apple')
Expected Output: [0, 1, 2]
Explanation: The query contains 'apple' twice, so its contribution is doubled. Document 0 has two apples and ranks above document 1, which has one.
Input: (['!!!'], 'anything')
Expected Output: [0]
Explanation: The only document tokenizes to an empty list, so its score is 0. There is only one valid index.
Input: (['rare common', 'common common', 'common'], 'rare common')
Expected Output: [0, 1, 2]
Explanation: The token 'rare' appears in fewer documents than 'common', so it gets a higher IDF. Document 0 contains both terms and ranks above document 1, even though document 1 contains 'common' twice.
Input: (['alpha', 'beta'], '!!!')
Expected Output: [0, 1]
Explanation: The query tokenizes to no tokens, so every score is 0 and the indices are returned in ascending order.
Solution
def solution(docs, q):
import math
import re
from collections import Counter
def tokenize(text):
return re.findall(r"[a-z0-9]+", text.lower())
n = len(docs)
query_tokens = tokenize(q)
if not query_tokens:
return list(range(n))
query_counts = Counter(query_tokens)
query_terms = set(query_counts)
# First pass: compute document frequency only for query terms.
df = {term: 0 for term in query_terms}
for doc in docs:
seen = set()
for token in tokenize(doc):
if token in query_terms and token not in seen:
df[token] += 1
seen.add(token)
# Compute smoothed IDF.
idf = {
term: math.log((n + 1) / (df[term] + 1)) + 1
for term in query_terms
}
# Second pass: compute scores.
scores = []
for doc in docs:
counts = Counter()
for token in tokenize(doc):
if token in query_terms:
counts[token] += 1
score = 0.0
for term, tf in counts.items():
score += tf * query_counts[term] * idf[term]
scores.append(score)
return sorted(range(n), key=lambda i: (-scores[i], i))Time complexity: O(T + N log N), where T is the total number of tokens across all documents after tokenization. Space complexity: O(Uq + N), where Uq is the number of unique query tokens.
Hints
- You only need document frequencies for terms that actually appear in the query.
- To save memory, compute document frequencies in one pass, then compute scores in a second pass once IDF values are known.