Implement TF–IDF from Scratch (Python + NumPy/SciPy)
You are given a list of documents (strings). Build a TF–IDF vectorizer from scratch with the following behavior.
Requirements
-
Tokenization and preprocessing
-
Convert to lowercase.
-
Strip punctuation.
-
Use a memory-efficient tokenizer (e.g., streaming/generator-based; no external NLP libs).
-
Vocabulary and filtering
-
Support optional min_df and max_df document-frequency filtering.
-
Interpret min_df and max_df as either:
-
Integer: absolute number of documents.
-
Float in (0, 1]: fraction of documents.
-
The vocabulary (columns) must be ordered lexicographically.
-
TF–IDF computation
-
Term frequency (TF) is raw count per document.
-
Smoothed IDF: idf = log((1 + N) / (1 + df)) + 1, where N is the number of documents, df is document frequency.
-
Output a SciPy CSR sparse matrix with shape (n_documents, vocab_size).
-
Transform behavior
-
Ignore out-of-vocabulary (OOV) tokens at transform time.
-
Support optional L2 normalization per row (document).
-
API surface
-
Implement a small class with methods fit, transform, fit_transform, and inverse_transform(doc_index, k=None), where inverse_transform returns the top-k terms for a given document by TF–IDF score (if k is None, return all terms sorted by score desc).
-
Complexity
-
Analyze time and space complexity for fit and transform in terms of number of documents, tokens, and vocabulary size.
-
Unit test
-
Provide a unit test that checks correctness on a 3-document toy corpus with repeated terms.
Constraints
-
Use only Python standard library, NumPy, and SciPy (no external NLP libraries).
-
Handle empty documents gracefully.
-
For reproducibility, ensure deterministic lexicographic column order.