Implement TF-IDF scoring for documents

Q: Implement TF-IDF scoring for documents

This question evaluates understanding of information retrieval and text representation concepts—specifically TF–IDF weighting, tokenization, term frequency and document frequency—and measures competency in implementing an efficient scoring and ranking algorithm.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

Problem

Implement a simplified TF–IDF scorer.

You are given:

A list of documents docs , where each document is a string.
A query string q .

Tokenization rules (for this problem):

Lowercase everything.
Split on non-letters/digits (you may assume a helper tokenizer is available or implement a simple one).
Ignore empty tokens.

Definitions

Term frequency: $tf(t, d) = \#(t \text{ in } d)$
Document frequency: $df(t) = \#\{ d \in docs : t \in d \}$
Inverse document frequency (smoothed):

$idf(t) = \log\left(\frac{N + 1}{df(t) + 1}\right) + 1$

where $N = |docs|$ .

Document score for query:

$score(d, q) = \sum_{t \in tokens(q)} tf(t, d) \cdot idf(t)$

Task

Return the document indices sorted by descending score(d, q). Break ties by smaller index.

Constraints

1 <= len(docs) <= 10^4
Total tokens across all documents up to ~ 2 * 10^6

Example

If docs = ["red apple", "green apple", "banana"] and q = "apple", documents 0 and 1 should rank above 2.