Implement TF–IDF with sparse matrices

Q: Implement TF–IDF with sparse matrices

This is a Coding & Algorithms interview question from Thumbtack for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

Implement TF–IDF from Scratch (Python + NumPy/SciPy)

You are given a list of documents (strings). Build a TF–IDF vectorizer from scratch with the following behavior.

Requirements

Tokenization and preprocessing
- Convert to lowercase.
- Strip punctuation.
- Use a memory-efficient tokenizer (e.g., streaming/generator-based; no external NLP libs).
Vocabulary and filtering
- Support optional min_df and max_df document-frequency filtering.
- Interpret min_df and max_df as either:
  - Integer: absolute number of documents.
  - Float in (0, 1]: fraction of documents.
- The vocabulary (columns) must be ordered lexicographically.
TF–IDF computation
- Term frequency (TF) is raw count per document.
- Smoothed IDF: idf = log((1 + N) / (1 + df)) + 1, where N is the number of documents, df is document frequency.
- Output a SciPy CSR sparse matrix with shape (n_documents, vocab_size).
Transform behavior
- Ignore out-of-vocabulary (OOV) tokens at transform time.
- Support optional L2 normalization per row (document).
API surface
- Implement a small class with methods fit, transform, fit_transform, and inverse_transform(doc_index, k=None), where inverse_transform returns the top-k terms for a given document by TF–IDF score (if k is None, return all terms sorted by score desc).
Complexity
- Analyze time and space complexity for fit and transform in terms of number of documents, tokens, and vocabulary size.
Unit test
- Provide a unit test that checks correctness on a 3-document toy corpus with repeated terms.

Constraints

Use only Python standard library, NumPy, and SciPy (no external NLP libraries).
Handle empty documents gracefully.
For reproducibility, ensure deterministic lexicographic column order.

Implement TF–IDF with sparse matrices

Implement TF–IDF from Scratch (Python + NumPy/SciPy)

Requirements

Constraints

Solution (Locked)

Comments (0)