Implement a bag-of-words–based text similarity search engine from scratch. Write code that:
(
-
tokenizes text (lowercasing, punctuation handling, Unicode support, and optional stopword removal/stemming—justify your choices),
(
-
builds document vectors using term frequency and supports TF–IDF weighting,
(
-
computes similarity scores (implement cosine similarity; optionally compare with Jaccard), and
(
-
returns the top-k most similar document IDs for a given query along with their scores. Clearly define each function’s purpose and inputs/outputs, and provide a short example demonstrating end-to-end usage. Analyze time and space complexity for indexing and querying, and briefly discuss how you would scale to large corpora (e.g., inverted index, pruning, or approximate search).