This question evaluates understanding of information retrieval and text representation concepts—specifically TF–IDF weighting, tokenization, term frequency and document frequency—and measures competency in implementing an efficient scoring and ranking algorithm.
Implement a simplified TF–IDF scorer.
You are given:
docs
, where each document is a string.
q
.
Tokenization rules (for this problem):
where .
Return the document indices sorted by descending score(d, q). Break ties by smaller index.
1 <= len(docs) <= 10^4
2 * 10^6
If docs = ["red apple", "green apple", "banana"] and q = "apple", documents 0 and 1 should rank above 2.