Implement TF–IDF from scratch. Given a list of documents (strings), build a memory-efficient tokenizer (lowercase, strip punctuation, optional min_df/max_df filtering), compute term frequencies, smoothed IDF = log((1 + N) / (1 + df)) + 1, and output a CSR sparse matrix (rows=documents, cols=vocabulary ordered lexicographically). Requirements: (1) do not use external NLP libraries; only Python, NumPy, and SciPy are allowed; (2) handle OOV tokens at transform time by ignoring them; (3) support L2 normalization per row; (4) provide an inverse_transform(doc_index) that reconstructs the top-k terms by TF–IDF score; (5) analyze time and space complexity for fit and transform; and (6) write a unit test showing correctness on a 3-document toy corpus with repeated terms.

# Approach Overview The vectorizer mirrors the structure of `sklearn`'s `TfidfVectorizer` but is built from primitives: - **Tokenizer** — a regex-based generator that streams lowercase alphanumeric tokens and discards punctuation. Being a generator keeps peak memory at one token, not one full token list per document. - **`fit`** — one pass over the corpus to compute document frequency (`df`) from the *unique* terms in each document, then build a lexicographically sorted vocabulary and the smoothed IDF vector. - **`transform`** — one pass to count in-vocabulary tokens per document, assemble a CSR matrix from `(data, indices, indptr)`, apply TF·IDF weighting, and optionally L2-normalize each row. - **`inverse_transform`** — read a single CSR row and return its highest-scoring terms. The two-stage `fit`/`transform` split is what lets the vocabulary be fixed before any vector is produced, and what makes OOV handling well-defined (a token is OOV iff it is absent from the fitted vocabulary). **Key formula (smoothed IDF):** $$\text{idf}_j = \log\!\left(\frac{1 + N}{1 + \text{df}_j}\right) + 1$$ with $N$ = number of documents and $\text{df}_j$ = number of documents containing term $j$. The $+1$ inside both numerator and denominator (smoothing) prevents division by zero, and the trailing $+1$ guarantees every term keeps a non-zero weight even when it appears in every document. --- ## Reference Implementation (Python + NumPy/SciPy only) ```python import re import math import numpy as np from scipy.sparse import csr_matrix class TfidfVectorizerScratch: """ A TF–IDF vectorizer built from scratch using only Python, NumPy, and SciPy. Parameters ---------- min_df : int or float, default=1 - int : keep terms with document frequency >= min_df. - float in (0, 1] : keep terms with df >= ceil(min_df * n_docs). max_df : int or float, default=1.0 - int : keep terms with document frequency <= max_df. - float in (0, 1] : keep terms with df <= floor(max_df * n_docs). normalize : bool, default=True If True, L2-normalize each row of the TF–IDF matrix. """ _token_re = re.compile(r"[a-z0-9]+") # lowercase alphanumeric runs def __init__(self, min_df=1, max_df=1.0, normalize=True): self.min_df = min_df self.max_df = max_df self.normalize = normalize # Learned after fit self.n_docs_ = None self.vocabulary_ = None # term -> column index self.terms_ = None # column index -> term (lexicographic) self.idf_ = None # np.ndarray aligned with terms_ self.df_ = None # term -> df self._last_X = None # cache for inverse_transform convenience # ------------------ Tokenization ------------------ @classmethod def _iter_tokens(cls, text): """Streaming tokenizer: lowercase, strip punctuation, yield one token at a time.""" for m in cls._token_re.finditer(text.lower()): yield m.group(0) # ------------------ Threshold resolution ------------------ @staticmethod def _resolve_df_thresholds(min_df, max_df, n_docs): """Convert int/float thresholds into absolute integer df bounds.""" if isinstance(min_df, float): if not (0.0 < min_df <= 1.0): raise ValueError("min_df float must be in (0, 1].") min_df_abs = math.ceil(min_df * n_docs) else: min_df_abs = int(min_df) if isinstance(max_df, float): if not (0.0 < max_df <= 1.0): raise ValueError("max_df float must be in (0, 1].") max_df_abs = math.floor(max_df * n_docs) else: max_df_abs = int(max_df) min_df_abs = max(min_df_abs, 1) if max_df_abs < min_df_abs: raise ValueError( f"max_df ({max_df_abs}) < min_df ({min_df_abs}); " "no term can satisfy both bounds." ) return min_df_abs, max_df_abs # ------------------ Vocabulary / IDF (shared) ------------------ def _build_vocab_and_idf(self, df, n_docs): """Given a df dict and n_docs, set terms_/vocabulary_/df_/idf_.""" self.n_docs_ = n_docs if n_docs == 0: self.terms_ = [] self.vocabulary_ = {} self.df_ = {} self.idf_ = np.zeros((0,), dtype=np.float64) return min_df_abs, max_df_abs = self._resolve_df_thresholds( self.min_df, self.max_df, n_docs ) terms = [t for t, c in df.items() if min_df_abs <= c <= max_df_abs] terms.sort() # lexicographic, deterministic column order self.terms_ = terms self.vocabulary_ = {t: i for i, t in enumerate(terms)} self.df_ = {t: df[t] for t in terms} idf = np.empty(len(terms), dtype=np.float64) for i, t in enumerate(terms): idf[i] = math.log((1 + n_docs) / (1 + df[t]

How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

What difficulty level is this interview question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Onsite rounds at Thumbtack.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Thumbtack during technical interviews.

Implement TF–IDF with sparse matrices | Thumbtack Interview Question

Q: Implement TF–IDF with sparse matrices

This question tests a candidate's practical understanding of information retrieval and natural language processing by requiring a from-scratch implementation of TF–IDF vectorization. It evaluates proficiency in numerical computing, sparse matrix construction, and algorithm design — skills commonly assessed for data science and machine learning engineering roles.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Q: What difficulty level is this interview question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Onsite rounds at Thumbtack.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Thumbtack during technical interviews.

Implement TF–IDF from Scratch (Python + NumPy/SciPy)

You are given a list of documents (plain strings). Implement a TF–IDF vectorizer from scratch — no scikit-learn, no external NLP libraries — that turns the corpus into a sparse document–term matrix you could feed to a downstream model.

Your vectorizer must support the following behavior:

Tokenization and preprocessing. Lowercase the text, strip punctuation, and tokenize using a memory-efficient (streaming / generator-based) tokenizer. No external NLP libs.
Vocabulary and filtering. Support optional min_df and max_df document-frequency filtering. Each threshold may be an integer (an absolute number of documents) or a float in $(0, 1]$ (a fraction of documents). The vocabulary (the matrix columns) must be ordered lexicographically for deterministic output.
TF–IDF computation. Term frequency (TF) is the raw count of a term in a document. Use the smoothed IDF $\text{idf}_t = \log\!\left(\frac{1 + N}{1 + \text{df}_t}\right) + 1$ where $N$ is the number of documents and $\text{df}_t$ is the number of documents containing term $t$ . Output a SciPy CSR sparse matrix of shape (n_documents, vocab_size) .
Transform behavior. At transform time, ignore out-of-vocabulary (OOV) tokens. Support optional L2 row normalization (each document vector scaled to unit length).
API surface. Expose a class with fit , transform , fit_transform , and inverse_transform(doc_index, k=None) , where inverse_transform returns the top- $k$ terms for a document ranked by TF–IDF score (if k is None , return all non-zero terms sorted by score descending).
Complexity. Analyze the time and space complexity of fit and transform in terms of the number of documents, total tokens, and vocabulary size.
Unit test. Provide a unit test that checks correctness on a 3-document toy corpus containing repeated terms.

Constraints & Assumptions

Allowed dependencies: Python standard library, NumPy, and SciPy only — no scikit-learn, NLTK, spaCy, or other NLP libraries.
Input is an iterable of strings. Assume it fits the streaming model: transform may re-iterate the corpus, so a re-iterable container (e.g. a list ) is acceptable; a one-shot generator passed to a separate fit then transform would be consumed — call out this assumption.
Handle empty documents gracefully (all-zero rows).
Output must be deterministic : identical input yields identical column order and values across runs.
Token definition can be assumed to be runs of lowercase alphanumeric characters ( [a-z0-9]+ ) unless the interviewer specifies otherwise — state your choice.

Clarifying Questions to Ask

How should tokens be defined — alphanumeric runs only, or keep apostrophes/hyphens (e.g. "don't", "state-of-the-art")? Should digits be kept or dropped?
What TF variant is expected — raw count (as stated), log-scaled $1 + \log(\text{tf})$ , or sublinear? (The prompt specifies raw count; confirm.)
Should transform accept a fresh corpus (unseen documents), or only re-encode the fitted corpus? This drives whether OOV handling is exercised.
Is idf defined exactly as given (the scikit-learn smoothed form), or should I support the unsmoothed variant $\log(N/\text{df})$ as well?
What is the expected scale (number of documents, vocabulary size)? This decides whether per-document count caching in fit_transform is acceptable or whether memory must stay bounded.

What a Strong Answer Covers

Correct, working code that compiles and passes the toy test: a class with fit / transform / fit_transform / inverse_transform , a streaming tokenizer, and a SciPy CSR output of the right shape.
Faithful IDF and TF semantics : smoothed IDF exactly as specified, raw-count TF, and df computed from unique terms per document (not raw counts).
Correct vocabulary handling : lexicographic column order, min_df / max_df resolved correctly for both int and float, OOV tokens dropped at transform time, L2 normalization applied per row and safe on zero rows.
Sparse-structure fluency : building CSR via (data, indices, indptr) rather than densifying; awareness of canonical column ordering and why it matters.
Complexity analysis that separates tokenization cost ( $O(T)$ ), the $O(V \log V)$ vocabulary sort, and the $O(\text{nnz})$ matrix assembly — in terms of documents $D$ , tokens $T$ , vocabulary $V$ , and non-zeros.
Robustness : empty documents, empty corpus, all-OOV documents, and the max_df < min_df misconfiguration all handled deliberately.
A meaningful unit test : asserts vocabulary order, specific IDF values, matrix shape/nnz, OOV behavior, normalized row norms $\approx 1$ , and inverse_transform ranking.

Follow-up Questions

How would you scale this to a corpus that does not fit in memory (tens of millions of documents)? Discuss the hashing trick (feature hashing) as an alternative to an explicit vocabulary, and the time/space trade-offs of dropping inverse_transform .
The current transform requires the corpus to be re-iterable. How would you restructure fit_transform to make a single streaming pass, and what does that cost in peak memory?
TF–IDF treats terms independently and ignores order. What would change if you needed n-grams (bigrams/trigrams), and how does that affect vocabulary size and sparsity?
How would you compute cosine similarity between two documents directly from the CSR matrix, and why does L2 normalization make that a simple dot product?
The smoothed IDF gives a term appearing in every document an IDF of $\log\!\big(\tfrac{1+N}{1+N}\big) + 1 = 1$ , not $0$ . Why might you prefer that over an unsmoothed IDF that zeroes such terms out?

Implement TF–IDF from Scratch (Python + NumPy/SciPy)

Your vectorizer must support the following behavior:

Tokenization and preprocessing. Lowercase the text, strip punctuation, and tokenize using a memory-efficient (streaming / generator-based) tokenizer. No external NLP libs.
Vocabulary and filtering. Support optional min_df and max_df document-frequency filtering. Each threshold may be an integer (an absolute number of documents) or a float in $(0, 1]$ (a fraction of documents). The vocabulary (the matrix columns) must be ordered lexicographically for deterministic output.
TF–IDF computation. Term frequency (TF) is the raw count of a term in a document. Use the smoothed IDF $\text{idf}_t = \log\!\left(\frac{1 + N}{1 + \text{df}_t}\right) + 1$ where $N$ is the number of documents and $\text{df}_t$ is the number of documents containing term $t$ . Output a SciPy CSR sparse matrix of shape (n_documents, vocab_size) .
Transform behavior. At transform time, ignore out-of-vocabulary (OOV) tokens. Support optional L2 row normalization (each document vector scaled to unit length).
API surface. Expose a class with fit , transform , fit_transform , and inverse_transform(doc_index, k=None) , where inverse_transform returns the top- $k$ terms for a document ranked by TF–IDF score (if k is None , return all non-zero terms sorted by score descending).
Complexity. Analyze the time and space complexity of fit and transform in terms of the number of documents, total tokens, and vocabulary size.
Unit test. Provide a unit test that checks correctness on a 3-document toy corpus containing repeated terms.

Constraints & Assumptions

Allowed dependencies: Python standard library, NumPy, and SciPy only — no scikit-learn, NLTK, spaCy, or other NLP libraries.
Input is an iterable of strings. Assume it fits the streaming model: transform may re-iterate the corpus, so a re-iterable container (e.g. a list ) is acceptable; a one-shot generator passed to a separate fit then transform would be consumed — call out this assumption.
Handle empty documents gracefully (all-zero rows).
Output must be deterministic : identical input yields identical column order and values across runs.
Token definition can be assumed to be runs of lowercase alphanumeric characters ( [a-z0-9]+ ) unless the interviewer specifies otherwise — state your choice.

Clarifying Questions to Ask

How should tokens be defined — alphanumeric runs only, or keep apostrophes/hyphens (e.g. "don't", "state-of-the-art")? Should digits be kept or dropped?
What TF variant is expected — raw count (as stated), log-scaled $1 + \log(\text{tf})$ , or sublinear? (The prompt specifies raw count; confirm.)
Should transform accept a fresh corpus (unseen documents), or only re-encode the fitted corpus? This drives whether OOV handling is exercised.
Is idf defined exactly as given (the scikit-learn smoothed form), or should I support the unsmoothed variant $\log(N/\text{df})$ as well?
What is the expected scale (number of documents, vocabulary size)? This decides whether per-document count caching in fit_transform is acceptable or whether memory must stay bounded.

What a Strong Answer Covers

Correct, working code that compiles and passes the toy test: a class with fit / transform / fit_transform / inverse_transform , a streaming tokenizer, and a SciPy CSR output of the right shape.
Faithful IDF and TF semantics : smoothed IDF exactly as specified, raw-count TF, and df computed from unique terms per document (not raw counts).
Correct vocabulary handling : lexicographic column order, min_df / max_df resolved correctly for both int and float, OOV tokens dropped at transform time, L2 normalization applied per row and safe on zero rows.
Sparse-structure fluency : building CSR via (data, indices, indptr) rather than densifying; awareness of canonical column ordering and why it matters.
Complexity analysis that separates tokenization cost ( $O(T)$ ), the $O(V \log V)$ vocabulary sort, and the $O(\text{nnz})$ matrix assembly — in terms of documents $D$ , tokens $T$ , vocabulary $V$ , and non-zeros.
Robustness : empty documents, empty corpus, all-OOV documents, and the max_df < min_df misconfiguration all handled deliberately.
A meaningful unit test : asserts vocabulary order, specific IDF values, matrix shape/nnz, OOV behavior, normalized row norms $\approx 1$ , and inverse_transform ranking.

Follow-up Questions

How would you scale this to a corpus that does not fit in memory (tens of millions of documents)? Discuss the hashing trick (feature hashing) as an alternative to an explicit vocabulary, and the time/space trade-offs of dropping inverse_transform .
The current transform requires the corpus to be re-iterable. How would you restructure fit_transform to make a single streaming pass, and what does that cost in peak memory?
TF–IDF treats terms independently and ignores order. What would change if you needed n-grams (bigrams/trigrams), and how does that affect vocabulary size and sparsity?
How would you compute cosine similarity between two documents directly from the CSR matrix, and why does L2 normalization make that a simple dot product?
The smoothed IDF gives a term appearing in every document an IDF of $\log\!\big(\tfrac{1+N}{1+N}\big) + 1 = 1$ , not $0$ . Why might you prefer that over an unsmoothed IDF that zeroes such terms out?

Implement TF–IDF with sparse matrices

Quick Overview

Implement TF–IDF with sparse matrices

Implement TF–IDF from Scratch (Python + NumPy/SciPy)

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP

Implement TF–IDF with sparse matrices

Quick Overview

Implement TF–IDF with sparse matrices

Implement TF–IDF from Scratch (Python + NumPy/SciPy)

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP

Implement TF–IDF with sparse matrices

Quick Overview