How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Apple.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Apple during technical interviews.

Implement TF-IDF scoring for documents

Quick Overview

This question evaluates understanding of information retrieval and text representation concepts—specifically TF–IDF weighting, tokenization, term frequency and document frequency—and measures competency in implementing an efficient scoring and ranking algorithm.

Company: Apple

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

## Problem Implement a simplified **TF–IDF** scorer. You are given: - A list of documents `docs`, where each document is a string. - A query string `q`. Tokenization rules (for this problem): - Lowercase everything. - Split on non-letters/digits (you may assume a helper tokenizer is available or implement a simple one). - Ignore empty tokens. ### Definitions - Term frequency: \( tf(t, d) = \#(t \text{ in } d) \) - Document frequency: \( df(t) = \#\{ d \in docs : t \in d \} \) - Inverse document frequency (smoothed): \[ idf(t) = \log\left(\frac{N + 1}{df(t) + 1}\right) + 1 \] where \(N = |docs|\). - Document score for query: \[ score(d, q) = \sum_{t \in tokens(q)} tf(t, d) \cdot idf(t) \] ### Task Return the document indices sorted by descending `score(d, q)`. Break ties by smaller index. ### Constraints - `1 <= len(docs) <= 10^4` - Total tokens across all documents up to ~`2 * 10^6` ### Example If `docs = ["red apple", "green apple", "banana"]` and `q = "apple"`, documents 0 and 1 should rank above 2.

Quick Answer: This question evaluates understanding of information retrieval and text representation concepts—specifically TF–IDF weighting, tokenization, term frequency and document frequency—and measures competency in implementing an efficient scoring and ranking algorithm.

Implement a simplified TF-IDF scorer. You are given a list of documents `docs`, where each document is a string, and a query string `q`. Tokenization rules: - Convert text to lowercase. - Split on any non-letter/non-digit character. - Ignore empty tokens. Definitions: - `tf(t, d)` = number of times token `t` appears in document `d` - `df(t)` = number of documents that contain token `t` at least once - `idf(t) = log((N + 1) / (df(t) + 1)) + 1`, where `N` is the number of documents - `score(d, q)` = sum of `tf(t, d) * idf(t)` over all tokens in the query Important: if a query token appears multiple times, it contributes multiple times to the sum. Return a list containing every document index exactly once, sorted by descending score. If two documents have the same score, the smaller index comes first. Use the natural logarithm.

Constraints

1 <= len(docs) <= 10^4
The total number of tokens across all documents after tokenization is at most about 2 * 10^6
Document and query text may contain letters, digits, spaces, and punctuation
Tokenization must lowercase text, split on non-alphanumeric characters, and ignore empty tokens

Examples

Input: (['red apple', 'green apple', 'banana'], 'apple')

Expected Output: [0, 1, 2]

Explanation: Documents 0 and 1 each contain 'apple' once, so they tie with a positive score. Document 2 gets score 0, and ties are broken by smaller index.

Input: (['Cat cat dog!', 'dog; mouse', 'cat'], 'cat dog')

Expected Output: [0, 1, 2]

Explanation: After tokenization and lowercasing, document 0 has tf(cat)=2 and tf(dog)=1, so it scores highest. Documents 1 and 2 each match one query term once, so they tie and index 1 comes before 2.

Input: (['v1 release notes', 'release v2', 'patch'], 'security')

Expected Output: [0, 1, 2]

Explanation: No document contains 'security', so every document gets score 0. The final order is by index.

Input: (['apple apple banana', 'apple banana banana', 'banana'], 'apple apple')

Expected Output: [0, 1, 2]

Explanation: The query contains 'apple' twice, so its contribution is doubled. Document 0 has two apples and ranks above document 1, which has one.

Input: (['!!!'], 'anything')

Expected Output: [0]

Explanation: The only document tokenizes to an empty list, so its score is 0. There is only one valid index.

Input: (['rare common', 'common common', 'common'], 'rare common')

Expected Output: [0, 1, 2]

Explanation: The token 'rare' appears in fewer documents than 'common', so it gets a higher IDF. Document 0 contains both terms and ranks above document 1, even though document 1 contains 'common' twice.

Input: (['alpha', 'beta'], '!!!')

Expected Output: [0, 1]

Explanation: The query tokenizes to no tokens, so every score is 0 and the indices are returned in ascending order.

Solution

def solution(docs, q):
    import math
    import re
    from collections import Counter

    def tokenize(text):
        return re.findall(r"[a-z0-9]+", text.lower())

    n = len(docs)
    query_tokens = tokenize(q)

    if not query_tokens:
        return list(range(n))

    query_counts = Counter(query_tokens)
    query_terms = set(query_counts)

    # First pass: compute document frequency only for query terms.
    df = {term: 0 for term in query_terms}
    for doc in docs:
        seen = set()
        for token in tokenize(doc):
            if token in query_terms and token not in seen:
                df[token] += 1
                seen.add(token)

    # Compute smoothed IDF.
    idf = {
        term: math.log((n + 1) / (df[term] + 1)) + 1
        for term in query_terms
    }

    # Second pass: compute scores.
    scores = []
    for doc in docs:
        counts = Counter()
        for token in tokenize(doc):
            if token in query_terms:
                counts[token] += 1

        score = 0.0
        for term, tf in counts.items():
            score += tf * query_counts[term] * idf[term]
        scores.append(score)

    return sorted(range(n), key=lambda i: (-scores[i], i))

Time complexity: O(T + N log N), where T is the total number of tokens across all documents after tokenization. Space complexity: O(Uq + N), where Uq is the number of unique query tokens.

Hints

You only need document frequencies for terms that actually appear in the query.
To save memory, compute document frequencies in one pass, then compute scores in a second pass once IDF values are known.

Quick Overview