Implement retrieval and evaluation for a simple RAG
Company: Harvey
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Onsite
Quick Answer: This question evaluates understanding of embedding-based retrieval, cosine similarity ranking, and retrieval evaluation metrics (Recall@10 and MRR@10), along with efficient NumPy/Pandas-based data processing for vector search in the Coding & Algorithms domain.
Part 1: Build Document Embeddings
Constraints
- 0 <= len(docs) <= 10000
- Each `doc_id` is a unique string.
- 0 <= len(token_embeddings) <= 50000
- All vectors in `token_embeddings` have the same length `d`, where 0 <= d <= 300.
- Unknown tokens are ignored.
- Repeated known tokens are counted repeatedly in the average.
Examples
Input: ([["d1", "Cat sat cat."], ["d2", "dog ran"], ["d3", "unknown"]], {"cat": [1, 0], "sat": [0, 2], "dog": [0, 1], "ran": [2, 2]})
Expected Output: {"d1": [0.6666666666666666, 0.6666666666666666], "d2": [1.0, 1.5], "d3": [0.0, 0.0]}
Explanation: `d1` uses cat, sat, cat, so its average is `([1,0]+[0,2]+[1,0])/3`.
Input: ([["a", "Hello, HELLO! 123"], ["b", ""]], {"hello": [2, 4], "123": [1, 1]})
Expected Output: {"a": [1.6666666666666667, 3.0], "b": [0.0, 0.0]}
Explanation: Tokenization is case-insensitive and keeps alphanumeric runs. Empty text produces a zero vector.
Input: ([], {"word": [3, 5]})
Expected Output: {}
Explanation: There are no documents to embed.
Input: ([["only", "nothing matches"]], {"known": [1, 2, 3]})
Expected Output: {"only": [0.0, 0.0, 0.0]}
Explanation: The document contains no known tokens, so it receives the zero vector.
Input: ([["d1", "anything"], ["d2", ""]], {})
Expected Output: {"d1": [], "d2": []}
Explanation: With no token embeddings, the embedding dimension is 0.
Hints
- Compute the embedding dimension from any vector in `token_embeddings`; if there is no vector, the dimension is 0.
- For each document, maintain a running sum vector and a count of known tokens, then divide at the end.
Part 2: Retrieve Top-k Documents by Cosine Similarity
Constraints
- 0 <= len(doc_ids) <= 10000
- 0 <= len(query_ids) <= 1000
- len(doc_ids) == len(doc_vectors)
- len(query_ids) == len(query_vectors)
- All vectors have the same dimension d, where 0 <= d <= 1000.
- 0 <= k <= 10000
- If k is larger than the number of documents, return all documents for each query.
Examples
Input: (["d1", "d2", "d3"], [[1, 0], [0, 1], [1, 1]], ["q1", "q2"], [[1, 0], [0, 2]], 2)
Expected Output: {"q1": ["d1", "d3"], "q2": ["d2", "d3"]}
Explanation: For `q1`, `d1` has cosine 1 and `d3` has cosine about 0.707. For `q2`, `d2` ranks first.
Input: (["a", "b", "c"], [[0, 0], [1, 0], [0, 1]], ["z"], [[0, 0]], 3)
Expected Output: {"z": ["a", "b", "c"]}
Explanation: The query vector has zero norm, so all similarities are 0. Ties follow original document order.
Input: (["x", "y"], [[1, 1], [-1, -1]], ["q"], [[1, 1]], 5)
Expected Output: {"q": ["x", "y"]}
Explanation: `k` is larger than the document count, so both documents are returned.
Input: ([], [], ["q1"], [[1, 0]], 10)
Expected Output: {"q1": []}
Explanation: There are no documents to retrieve.
Input: (["d1", "d2"], [[1, 0], [0, 1]], [], [], 1)
Expected Output: {}
Explanation: There are no queries, so the result is empty.
Hints
- Precompute document norms once, since they are reused for every query.
- To rank documents, sort by `(-similarity, original_document_index)` or use a heap with the same key.
Part 3: Evaluate Retrieval with Recall@k and MRR@k
Constraints
- 0 <= number of retrieved queries <= 100000
- 0 <= k <= 100000
- Document IDs and query IDs are strings.
- Each retrieved ranking may contain zero or more document IDs.
- A query with no relevant documents contributes 0 to both metrics.
- If there are no retrieved queries, return 0.0 for both metrics.
Examples
Input: ({"q1": ["d3", "d2", "d1"], "q2": ["d4", "d5"]}, {"q1": ["d2"], "q2": ["d9"]}, 3)
Expected Output: {"recall_at_k": 0.5, "mrr_at_k": 0.25}
Explanation: `q1` has its first relevant document at rank 2, contributing recall 1 and reciprocal rank 1/2. `q2` has no hit.
Input: ({"q1": ["a", "b", "c"], "q2": ["x", "y"]}, {"q1": ["c"], "q2": ["x"]}, 2)
Expected Output: {"recall_at_k": 0.5, "mrr_at_k": 0.5}
Explanation: `q1`'s relevant document is at rank 3, outside top 2. `q2` hits at rank 1.
Input: ({"q": ["a", "b", "a", "c"]}, {"q": ["c", "a"]}, 4)
Expected Output: {"recall_at_k": 1.0, "mrr_at_k": 1.0}
Explanation: The first retrieved document is relevant, so the reciprocal rank is 1.
Input: ({}, {"q1": ["d1"]}, 10)
Expected Output: {"recall_at_k": 0.0, "mrr_at_k": 0.0}
Explanation: No queries are evaluated.
Input: ({"q1": ["d1"], "q2": []}, {"q1": ["d1"]}, 0)
Expected Output: {"recall_at_k": 0.0, "mrr_at_k": 0.0}
Explanation: With k = 0, no documents are considered for any query.
Hints
- Convert each query's relevant document list to a set before scanning its top-k results.
- Stop scanning a query as soon as you find the first relevant retrieved document.