How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Onsite rounds at Harvey.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Harvey during technical interviews.

Implement retrieval and evaluation for a simple RAG

Quick Overview

This question evaluates understanding of embedding-based retrieval, cosine similarity ranking, and retrieval evaluation metrics (Recall@10 and MRR@10), along with efficient NumPy/Pandas-based data processing for vector search in the Coding & Algorithms domain.

Implement retrieval and evaluation for a simple RAG

Company: Harvey

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

## Task You are given a small “toy RAG” notebook setup with helper utilities. You have a corpus of text passages stored in a **pandas DataFrame** `docs_df` with columns: - `doc_id` (string/int) - `text` (string) You are also given: - A utility function `embed_texts(texts: List[str]) -> np.ndarray` that returns a 2D array of embeddings with shape `(n, d)`. - A DataFrame `queries_df` with columns: - `query_id` - `query_text` - `relevant_doc_ids` (a list of doc_ids that are considered relevant for that query) ### Part A — Build document embeddings Compute embeddings for every row in `docs_df` and store them so they can be used for retrieval. ### Part B — Retrieve top-k by cosine similarity For each query: 1. Compute the query embedding. 2. Compute **cosine similarity** between the query embedding and all document embeddings. 3. Return the **top 10** documents ranked by similarity. Your output per query should include: - `query_id` - a ranked list of the 10 retrieved `doc_id`s - optionally the similarity scores ### Part C — Evaluate retrieval Implement evaluation over all queries using: - **Recall@10**: fraction of queries where at least one relevant document appears in the top 10. - **MRR@10**: mean reciprocal rank of the first relevant document within the top 10 (0 if none). ## Constraints / Notes - Assume `docs_df` can be large enough that you should avoid unnecessary Python loops over documents. - Embeddings may or may not be pre-normalized; handle cosine similarity correctly. - You may use NumPy/Pandas only (no external vector databases). Describe your approach and implement the missing pieces (as if filling blanks in a notebook).

Quick Answer: This question evaluates understanding of embedding-based retrieval, cosine similarity ranking, and retrieval evaluation metrics (Recall@10 and MRR@10), along with efficient NumPy/Pandas-based data processing for vector search in the Coding & Algorithms domain.

Part 1: Build Document Embeddings

You are given a collection of documents and a small token embedding table. Implement the document embedding step for a toy retrieval system. Each document has a unique string `doc_id` and a `text` field. Each known token maps to a numeric vector of length `d`. To embed a document: 1. Tokenize the document text into lowercase alphanumeric tokens. A token is any maximal contiguous sequence matching `[a-z0-9]+` after lowercasing. 2. Ignore tokens that are not present in `token_embeddings`. 3. Average the vectors of all known tokens that appear in the document. Repeated tokens count multiple times. 4. If a document has no known tokens, its embedding is the zero vector of length `d`. 5. If `token_embeddings` is empty, use dimension `d = 0`, so every embedding is an empty list. Return a dictionary mapping each `doc_id` to its computed embedding.

Constraints

0 <= len(docs) <= 10000
Each `doc_id` is a unique string.
0 <= len(token_embeddings) <= 50000
All vectors in `token_embeddings` have the same length `d`, where 0 <= d <= 300.
Unknown tokens are ignored.
Repeated known tokens are counted repeatedly in the average.

Examples

Input: ([["d1", "Cat sat cat."], ["d2", "dog ran"], ["d3", "unknown"]], {"cat": [1, 0], "sat": [0, 2], "dog": [0, 1], "ran": [2, 2]})

Expected Output: {"d1": [0.6666666666666666, 0.6666666666666666], "d2": [1.0, 1.5], "d3": [0.0, 0.0]}

Explanation: `d1` uses cat, sat, cat, so its average is `([1,0]+[0,2]+[1,0])/3`.

Input: ([["a", "Hello, HELLO! 123"], ["b", ""]], {"hello": [2, 4], "123": [1, 1]})

Expected Output: {"a": [1.6666666666666667, 3.0], "b": [0.0, 0.0]}

Explanation: Tokenization is case-insensitive and keeps alphanumeric runs. Empty text produces a zero vector.

Input: ([], {"word": [3, 5]})

Expected Output: {}

Explanation: There are no documents to embed.

Input: ([["only", "nothing matches"]], {"known": [1, 2, 3]})

Expected Output: {"only": [0.0, 0.0, 0.0]}

Explanation: The document contains no known tokens, so it receives the zero vector.

Input: ([["d1", "anything"], ["d2", ""]], {})

Expected Output: {"d1": [], "d2": []}

Explanation: With no token embeddings, the embedding dimension is 0.

Hints

Compute the embedding dimension from any vector in `token_embeddings`; if there is no vector, the dimension is 0.
For each document, maintain a running sum vector and a count of known tokens, then divide at the end.

Part 2: Retrieve Top-k Documents by Cosine Similarity

You are given precomputed document embeddings and query embeddings. Implement the retrieval step for a toy search system. For each query, compute the cosine similarity between the query vector and every document vector. Return the top `k` document IDs sorted by decreasing similarity. Cosine similarity between vectors `a` and `b` is: `dot(a, b) / (norm(a) * norm(b))` If either vector has norm 0, define its cosine similarity with the other vector as 0. Tie-breaking rule: if two documents have the same similarity for a query, the document that appears earlier in `doc_ids` ranks first.

Constraints

0 <= len(doc_ids) <= 10000
0 <= len(query_ids) <= 1000
len(doc_ids) == len(doc_vectors)
len(query_ids) == len(query_vectors)
All vectors have the same dimension d, where 0 <= d <= 1000.
0 <= k <= 10000
If k is larger than the number of documents, return all documents for each query.

Examples

Input: (["d1", "d2", "d3"], [[1, 0], [0, 1], [1, 1]], ["q1", "q2"], [[1, 0], [0, 2]], 2)

Expected Output: {"q1": ["d1", "d3"], "q2": ["d2", "d3"]}

Explanation: For `q1`, `d1` has cosine 1 and `d3` has cosine about 0.707. For `q2`, `d2` ranks first.

Input: (["a", "b", "c"], [[0, 0], [1, 0], [0, 1]], ["z"], [[0, 0]], 3)

Expected Output: {"z": ["a", "b", "c"]}

Explanation: The query vector has zero norm, so all similarities are 0. Ties follow original document order.

Input: (["x", "y"], [[1, 1], [-1, -1]], ["q"], [[1, 1]], 5)

Expected Output: {"q": ["x", "y"]}

Explanation: `k` is larger than the document count, so both documents are returned.

Input: ([], [], ["q1"], [[1, 0]], 10)

Expected Output: {"q1": []}

Explanation: There are no documents to retrieve.

Input: (["d1", "d2"], [[1, 0], [0, 1]], [], [], 1)

Expected Output: {}

Explanation: There are no queries, so the result is empty.

Hints

Precompute document norms once, since they are reused for every query.
To rank documents, sort by `(-similarity, original_document_index)` or use a heap with the same key.

Part 3: Evaluate Retrieval with Recall@k and MRR@k

You are given retrieval results for multiple queries and a set of relevant documents for each query. Implement two common retrieval evaluation metrics: 1. `Recall@k`: the fraction of evaluated queries where at least one relevant document appears in the top `k` retrieved documents. 2. `MRR@k`: mean reciprocal rank of the first relevant document within the top `k`. A query contributes `1 / rank` if its first relevant document appears at 1-based rank `rank`; otherwise it contributes 0. Evaluate only the query IDs present in `retrieved`. If a query is missing from `relevant`, treat it as having no relevant documents. Extra query IDs in `relevant` are ignored.

Constraints

0 <= number of retrieved queries <= 100000
0 <= k <= 100000
Document IDs and query IDs are strings.
Each retrieved ranking may contain zero or more document IDs.
A query with no relevant documents contributes 0 to both metrics.
If there are no retrieved queries, return 0.0 for both metrics.

Examples

Input: ({"q1": ["d3", "d2", "d1"], "q2": ["d4", "d5"]}, {"q1": ["d2"], "q2": ["d9"]}, 3)

Expected Output: {"recall_at_k": 0.5, "mrr_at_k": 0.25}

Explanation: `q1` has its first relevant document at rank 2, contributing recall 1 and reciprocal rank 1/2. `q2` has no hit.

Input: ({"q1": ["a", "b", "c"], "q2": ["x", "y"]}, {"q1": ["c"], "q2": ["x"]}, 2)

Expected Output: {"recall_at_k": 0.5, "mrr_at_k": 0.5}

Explanation: `q1`'s relevant document is at rank 3, outside top 2. `q2` hits at rank 1.

Input: ({"q": ["a", "b", "a", "c"]}, {"q": ["c", "a"]}, 4)

Expected Output: {"recall_at_k": 1.0, "mrr_at_k": 1.0}

Explanation: The first retrieved document is relevant, so the reciprocal rank is 1.

Input: ({}, {"q1": ["d1"]}, 10)

Expected Output: {"recall_at_k": 0.0, "mrr_at_k": 0.0}

Explanation: No queries are evaluated.

Input: ({"q1": ["d1"], "q2": []}, {"q1": ["d1"]}, 0)

Expected Output: {"recall_at_k": 0.0, "mrr_at_k": 0.0}

Explanation: With k = 0, no documents are considered for any query.

Hints

Convert each query's relevant document list to a set before scanning its top-k results.
Stop scanning a query as soon as you find the first relevant retrieved document.

Quick Overview

Implement retrieval and evaluation for a simple RAG

Company: Harvey

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

Part 1: Build Document Embeddings

Constraints

0 <= len(docs) <= 10000
Each `doc_id` is a unique string.
0 <= len(token_embeddings) <= 50000
All vectors in `token_embeddings` have the same length `d`, where 0 <= d <= 300.
Unknown tokens are ignored.
Repeated known tokens are counted repeatedly in the average.

Examples

Input: ([["d1", "Cat sat cat."], ["d2", "dog ran"], ["d3", "unknown"]], {"cat": [1, 0], "sat": [0, 2], "dog": [0, 1], "ran": [2, 2]})

Expected Output: {"d1": [0.6666666666666666, 0.6666666666666666], "d2": [1.0, 1.5], "d3": [0.0, 0.0]}

Explanation: `d1` uses cat, sat, cat, so its average is `([1,0]+[0,2]+[1,0])/3`.

Input: ([["a", "Hello, HELLO! 123"], ["b", ""]], {"hello": [2, 4], "123": [1, 1]})

Expected Output: {"a": [1.6666666666666667, 3.0], "b": [0.0, 0.0]}

Explanation: Tokenization is case-insensitive and keeps alphanumeric runs. Empty text produces a zero vector.

Input: ([], {"word": [3, 5]})

Expected Output: {}

Explanation: There are no documents to embed.

Input: ([["only", "nothing matches"]], {"known": [1, 2, 3]})

Expected Output: {"only": [0.0, 0.0, 0.0]}

Explanation: The document contains no known tokens, so it receives the zero vector.

Input: ([["d1", "anything"], ["d2", ""]], {})

Expected Output: {"d1": [], "d2": []}

Explanation: With no token embeddings, the embedding dimension is 0.

Hints

Compute the embedding dimension from any vector in `token_embeddings`; if there is no vector, the dimension is 0.
For each document, maintain a running sum vector and a count of known tokens, then divide at the end.

Part 2: Retrieve Top-k Documents by Cosine Similarity

Constraints

0 <= len(doc_ids) <= 10000
0 <= len(query_ids) <= 1000
len(doc_ids) == len(doc_vectors)
len(query_ids) == len(query_vectors)
All vectors have the same dimension d, where 0 <= d <= 1000.
0 <= k <= 10000
If k is larger than the number of documents, return all documents for each query.

Examples

Input: (["d1", "d2", "d3"], [[1, 0], [0, 1], [1, 1]], ["q1", "q2"], [[1, 0], [0, 2]], 2)

Expected Output: {"q1": ["d1", "d3"], "q2": ["d2", "d3"]}

Explanation: For `q1`, `d1` has cosine 1 and `d3` has cosine about 0.707. For `q2`, `d2` ranks first.

Input: (["a", "b", "c"], [[0, 0], [1, 0], [0, 1]], ["z"], [[0, 0]], 3)

Expected Output: {"z": ["a", "b", "c"]}

Explanation: The query vector has zero norm, so all similarities are 0. Ties follow original document order.

Input: (["x", "y"], [[1, 1], [-1, -1]], ["q"], [[1, 1]], 5)

Expected Output: {"q": ["x", "y"]}

Explanation: `k` is larger than the document count, so both documents are returned.

Input: ([], [], ["q1"], [[1, 0]], 10)

Expected Output: {"q1": []}

Explanation: There are no documents to retrieve.

Input: (["d1", "d2"], [[1, 0], [0, 1]], [], [], 1)

Expected Output: {}

Explanation: There are no queries, so the result is empty.

Hints

Precompute document norms once, since they are reused for every query.
To rank documents, sort by `(-similarity, original_document_index)` or use a heap with the same key.

Part 3: Evaluate Retrieval with Recall@k and MRR@k

Constraints

0 <= number of retrieved queries <= 100000
0 <= k <= 100000
Document IDs and query IDs are strings.
Each retrieved ranking may contain zero or more document IDs.
A query with no relevant documents contributes 0 to both metrics.
If there are no retrieved queries, return 0.0 for both metrics.

Examples

Input: ({"q1": ["d3", "d2", "d1"], "q2": ["d4", "d5"]}, {"q1": ["d2"], "q2": ["d9"]}, 3)

Expected Output: {"recall_at_k": 0.5, "mrr_at_k": 0.25}

Explanation: `q1` has its first relevant document at rank 2, contributing recall 1 and reciprocal rank 1/2. `q2` has no hit.

Input: ({"q1": ["a", "b", "c"], "q2": ["x", "y"]}, {"q1": ["c"], "q2": ["x"]}, 2)

Expected Output: {"recall_at_k": 0.5, "mrr_at_k": 0.5}

Explanation: `q1`'s relevant document is at rank 3, outside top 2. `q2` hits at rank 1.

Input: ({"q": ["a", "b", "a", "c"]}, {"q": ["c", "a"]}, 4)

Expected Output: {"recall_at_k": 1.0, "mrr_at_k": 1.0}

Explanation: The first retrieved document is relevant, so the reciprocal rank is 1.

Input: ({}, {"q1": ["d1"]}, 10)

Expected Output: {"recall_at_k": 0.0, "mrr_at_k": 0.0}

Explanation: No queries are evaluated.

Input: ({"q1": ["d1"], "q2": []}, {"q1": ["d1"]}, 0)

Expected Output: {"recall_at_k": 0.0, "mrr_at_k": 0.0}

Explanation: With k = 0, no documents are considered for any query.

Hints

Convert each query's relevant document list to a set before scanning its top-k results.
Stop scanning a query as soon as you find the first relevant retrieved document.