Implement bag-of-words similarity search from scratch
Company: Apple
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Implement a bag-of-words–based text similarity search engine from scratch. Write code that:
(
1) tokenizes text (lowercasing, punctuation handling, Unicode support, and optional stopword removal/stemming—justify your choices),
(
2) builds document vectors using term frequency and supports TF–IDF weighting,
(
3) computes similarity scores (implement cosine similarity; optionally compare with Jaccard), and
(
4) returns the top-k most similar document IDs for a given query along with their scores. Clearly define each function’s purpose and inputs/outputs, and provide a short example demonstrating end-to-end usage. Analyze time and space complexity for indexing and querying, and briefly discuss how you would scale to large corpora (e.g., inverted index, pruning, or approximate search).
Quick Answer: This question evaluates knowledge and implementation skills in text processing, vector-space models, and information retrieval, covering tokenization, term-frequency and TF–IDF weighting, cosine similarity, and ranking for similarity search within the Coding & Algorithms domain.