PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates understanding of embedding-based retrieval, cosine similarity ranking, and retrieval evaluation metrics (Recall@10 and MRR@10), along with efficient NumPy/Pandas-based data processing for vector search in the Coding & Algorithms domain.

  • medium
  • Harvey
  • Coding & Algorithms
  • Software Engineer

Implement retrieval and evaluation for a simple RAG

Company: Harvey

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

## Task You are given a small “toy RAG” notebook setup with helper utilities. You have a corpus of text passages stored in a **pandas DataFrame** `docs_df` with columns: - `doc_id` (string/int) - `text` (string) You are also given: - A utility function `embed_texts(texts: List[str]) -> np.ndarray` that returns a 2D array of embeddings with shape `(n, d)`. - A DataFrame `queries_df` with columns: - `query_id` - `query_text` - `relevant_doc_ids` (a list of doc_ids that are considered relevant for that query) ### Part A — Build document embeddings Compute embeddings for every row in `docs_df` and store them so they can be used for retrieval. ### Part B — Retrieve top-k by cosine similarity For each query: 1. Compute the query embedding. 2. Compute **cosine similarity** between the query embedding and all document embeddings. 3. Return the **top 10** documents ranked by similarity. Your output per query should include: - `query_id` - a ranked list of the 10 retrieved `doc_id`s - optionally the similarity scores ### Part C — Evaluate retrieval Implement evaluation over all queries using: - **Recall@10**: fraction of queries where at least one relevant document appears in the top 10. - **MRR@10**: mean reciprocal rank of the first relevant document within the top 10 (0 if none). ## Constraints / Notes - Assume `docs_df` can be large enough that you should avoid unnecessary Python loops over documents. - Embeddings may or may not be pre-normalized; handle cosine similarity correctly. - You may use NumPy/Pandas only (no external vector databases). Describe your approach and implement the missing pieces (as if filling blanks in a notebook).

Quick Answer: This question evaluates understanding of embedding-based retrieval, cosine similarity ranking, and retrieval evaluation metrics (Recall@10 and MRR@10), along with efficient NumPy/Pandas-based data processing for vector search in the Coding & Algorithms domain.

Part 1: Build Document Embeddings

You are given a collection of documents and a small token embedding table. Implement the document embedding step for a toy retrieval system. Each document has a unique string `doc_id` and a `text` field. Each known token maps to a numeric vector of length `d`. To embed a document: 1. Tokenize the document text into lowercase alphanumeric tokens. A token is any maximal contiguous sequence matching `[a-z0-9]+` after lowercasing. 2. Ignore tokens that are not present in `token_embeddings`. 3. Average the vectors of all known tokens that appear in the document. Repeated tokens count multiple times. 4. If a document has no known tokens, its embedding is the zero vector of length `d`. 5. If `token_embeddings` is empty, use dimension `d = 0`, so every embedding is an empty list. Return a dictionary mapping each `doc_id` to its computed embedding.

Constraints

  • 0 <= len(docs) <= 10000
  • Each `doc_id` is a unique string.
  • 0 <= len(token_embeddings) <= 50000
  • All vectors in `token_embeddings` have the same length `d`, where 0 <= d <= 300.
  • Unknown tokens are ignored.
  • Repeated known tokens are counted repeatedly in the average.

Examples

Input: ([["d1", "Cat sat cat."], ["d2", "dog ran"], ["d3", "unknown"]], {"cat": [1, 0], "sat": [0, 2], "dog": [0, 1], "ran": [2, 2]})

Expected Output: {"d1": [0.6666666666666666, 0.6666666666666666], "d2": [1.0, 1.5], "d3": [0.0, 0.0]}

Explanation: `d1` uses cat, sat, cat, so its average is `([1,0]+[0,2]+[1,0])/3`.

Input: ([["a", "Hello, HELLO! 123"], ["b", ""]], {"hello": [2, 4], "123": [1, 1]})

Expected Output: {"a": [1.6666666666666667, 3.0], "b": [0.0, 0.0]}

Explanation: Tokenization is case-insensitive and keeps alphanumeric runs. Empty text produces a zero vector.

Input: ([], {"word": [3, 5]})

Expected Output: {}

Explanation: There are no documents to embed.

Input: ([["only", "nothing matches"]], {"known": [1, 2, 3]})

Expected Output: {"only": [0.0, 0.0, 0.0]}

Explanation: The document contains no known tokens, so it receives the zero vector.

Input: ([["d1", "anything"], ["d2", ""]], {})

Expected Output: {"d1": [], "d2": []}

Explanation: With no token embeddings, the embedding dimension is 0.

Hints

  1. Compute the embedding dimension from any vector in `token_embeddings`; if there is no vector, the dimension is 0.
  2. For each document, maintain a running sum vector and a count of known tokens, then divide at the end.

Part 2: Retrieve Top-k Documents by Cosine Similarity

You are given precomputed document embeddings and query embeddings. Implement the retrieval step for a toy search system. For each query, compute the cosine similarity between the query vector and every document vector. Return the top `k` document IDs sorted by decreasing similarity. Cosine similarity between vectors `a` and `b` is: `dot(a, b) / (norm(a) * norm(b))` If either vector has norm 0, define its cosine similarity with the other vector as 0. Tie-breaking rule: if two documents have the same similarity for a query, the document that appears earlier in `doc_ids` ranks first.

Constraints

  • 0 <= len(doc_ids) <= 10000
  • 0 <= len(query_ids) <= 1000
  • len(doc_ids) == len(doc_vectors)
  • len(query_ids) == len(query_vectors)
  • All vectors have the same dimension d, where 0 <= d <= 1000.
  • 0 <= k <= 10000
  • If k is larger than the number of documents, return all documents for each query.

Examples

Input: (["d1", "d2", "d3"], [[1, 0], [0, 1], [1, 1]], ["q1", "q2"], [[1, 0], [0, 2]], 2)

Expected Output: {"q1": ["d1", "d3"], "q2": ["d2", "d3"]}

Explanation: For `q1`, `d1` has cosine 1 and `d3` has cosine about 0.707. For `q2`, `d2` ranks first.

Input: (["a", "b", "c"], [[0, 0], [1, 0], [0, 1]], ["z"], [[0, 0]], 3)

Expected Output: {"z": ["a", "b", "c"]}

Explanation: The query vector has zero norm, so all similarities are 0. Ties follow original document order.

Input: (["x", "y"], [[1, 1], [-1, -1]], ["q"], [[1, 1]], 5)

Expected Output: {"q": ["x", "y"]}

Explanation: `k` is larger than the document count, so both documents are returned.

Input: ([], [], ["q1"], [[1, 0]], 10)

Expected Output: {"q1": []}

Explanation: There are no documents to retrieve.

Input: (["d1", "d2"], [[1, 0], [0, 1]], [], [], 1)

Expected Output: {}

Explanation: There are no queries, so the result is empty.

Hints

  1. Precompute document norms once, since they are reused for every query.
  2. To rank documents, sort by `(-similarity, original_document_index)` or use a heap with the same key.

Part 3: Evaluate Retrieval with Recall@k and MRR@k

You are given retrieval results for multiple queries and a set of relevant documents for each query. Implement two common retrieval evaluation metrics: 1. `Recall@k`: the fraction of evaluated queries where at least one relevant document appears in the top `k` retrieved documents. 2. `MRR@k`: mean reciprocal rank of the first relevant document within the top `k`. A query contributes `1 / rank` if its first relevant document appears at 1-based rank `rank`; otherwise it contributes 0. Evaluate only the query IDs present in `retrieved`. If a query is missing from `relevant`, treat it as having no relevant documents. Extra query IDs in `relevant` are ignored.

Constraints

  • 0 <= number of retrieved queries <= 100000
  • 0 <= k <= 100000
  • Document IDs and query IDs are strings.
  • Each retrieved ranking may contain zero or more document IDs.
  • A query with no relevant documents contributes 0 to both metrics.
  • If there are no retrieved queries, return 0.0 for both metrics.

Examples

Input: ({"q1": ["d3", "d2", "d1"], "q2": ["d4", "d5"]}, {"q1": ["d2"], "q2": ["d9"]}, 3)

Expected Output: {"recall_at_k": 0.5, "mrr_at_k": 0.25}

Explanation: `q1` has its first relevant document at rank 2, contributing recall 1 and reciprocal rank 1/2. `q2` has no hit.

Input: ({"q1": ["a", "b", "c"], "q2": ["x", "y"]}, {"q1": ["c"], "q2": ["x"]}, 2)

Expected Output: {"recall_at_k": 0.5, "mrr_at_k": 0.5}

Explanation: `q1`'s relevant document is at rank 3, outside top 2. `q2` hits at rank 1.

Input: ({"q": ["a", "b", "a", "c"]}, {"q": ["c", "a"]}, 4)

Expected Output: {"recall_at_k": 1.0, "mrr_at_k": 1.0}

Explanation: The first retrieved document is relevant, so the reciprocal rank is 1.

Input: ({}, {"q1": ["d1"]}, 10)

Expected Output: {"recall_at_k": 0.0, "mrr_at_k": 0.0}

Explanation: No queries are evaluated.

Input: ({"q1": ["d1"], "q2": []}, {"q1": ["d1"]}, 0)

Expected Output: {"recall_at_k": 0.0, "mrr_at_k": 0.0}

Explanation: With k = 0, no documents are considered for any query.

Hints

  1. Convert each query's relevant document list to a set before scanning its top-k results.
  2. Stop scanning a query as soon as you find the first relevant retrieved document.
Last updated: Jun 21, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • In-Memory File System: Create, Track Duplicates, and List - Harvey (medium)
  • Design an In-Memory Spreadsheet Engine - Harvey (hard)
  • Implement a Spreadsheet With Formulas - Harvey (medium)
  • Implement Spreadsheet Cell Updates - Harvey (hard)
  • Evaluate Symbol Expressions - Harvey (medium)