PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Coding & Algorithms/Harvey

Implement retrieval and evaluation for a simple RAG

Last updated: Mar 29, 2026

Quick Overview

This question evaluates understanding of embedding-based retrieval, cosine similarity ranking, and retrieval evaluation metrics (Recall@10 and MRR@10), along with efficient NumPy/Pandas-based data processing for vector search in the Coding & Algorithms domain.

  • medium
  • Harvey
  • Coding & Algorithms
  • Software Engineer

Implement retrieval and evaluation for a simple RAG

Company: Harvey

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

## Task You are given a small “toy RAG” notebook setup with helper utilities. You have a corpus of text passages stored in a **pandas DataFrame** `docs_df` with columns: - `doc_id` (string/int) - `text` (string) You are also given: - A utility function `embed_texts(texts: List[str]) -> np.ndarray` that returns a 2D array of embeddings with shape `(n, d)`. - A DataFrame `queries_df` with columns: - `query_id` - `query_text` - `relevant_doc_ids` (a list of doc_ids that are considered relevant for that query) ### Part A — Build document embeddings Compute embeddings for every row in `docs_df` and store them so they can be used for retrieval. ### Part B — Retrieve top-k by cosine similarity For each query: 1. Compute the query embedding. 2. Compute **cosine similarity** between the query embedding and all document embeddings. 3. Return the **top 10** documents ranked by similarity. Your output per query should include: - `query_id` - a ranked list of the 10 retrieved `doc_id`s - optionally the similarity scores ### Part C — Evaluate retrieval Implement evaluation over all queries using: - **Recall@10**: fraction of queries where at least one relevant document appears in the top 10. - **MRR@10**: mean reciprocal rank of the first relevant document within the top 10 (0 if none). ## Constraints / Notes - Assume `docs_df` can be large enough that you should avoid unnecessary Python loops over documents. - Embeddings may or may not be pre-normalized; handle cosine similarity correctly. - You may use NumPy/Pandas only (no external vector databases). Describe your approach and implement the missing pieces (as if filling blanks in a notebook).

Quick Answer: This question evaluates understanding of embedding-based retrieval, cosine similarity ranking, and retrieval evaluation metrics (Recall@10 and MRR@10), along with efficient NumPy/Pandas-based data processing for vector search in the Coding & Algorithms domain.

Related Interview Questions

  • Implement a Spreadsheet With Formulas - Harvey (medium)
  • Implement Spreadsheet Cell Updates - Harvey (hard)
  • Evaluate Symbol Expressions - Harvey (medium)
  • Implement a Cursor-Based Text Editor - Harvey (medium)
  • Highlight Exact Source Matches - Harvey (medium)
Harvey logo
Harvey
Mar 1, 2026, 12:00 AM
Software Engineer
Onsite
Coding & Algorithms
8
0
Loading...

Task

You are given a small “toy RAG” notebook setup with helper utilities.

You have a corpus of text passages stored in a pandas DataFrame docs_df with columns:

  • doc_id (string/int)
  • text (string)

You are also given:

  • A utility function embed_texts(texts: List[str]) -> np.ndarray that returns a 2D array of embeddings with shape (n, d) .
  • A DataFrame queries_df with columns:
    • query_id
    • query_text
    • relevant_doc_ids (a list of doc_ids that are considered relevant for that query)

Part A — Build document embeddings

Compute embeddings for every row in docs_df and store them so they can be used for retrieval.

Part B — Retrieve top-k by cosine similarity

For each query:

  1. Compute the query embedding.
  2. Compute cosine similarity between the query embedding and all document embeddings.
  3. Return the top 10 documents ranked by similarity.

Your output per query should include:

  • query_id
  • a ranked list of the 10 retrieved doc_id s
  • optionally the similarity scores

Part C — Evaluate retrieval

Implement evaluation over all queries using:

  • Recall@10 : fraction of queries where at least one relevant document appears in the top 10.
  • MRR@10 : mean reciprocal rank of the first relevant document within the top 10 (0 if none).

Constraints / Notes

  • Assume docs_df can be large enough that you should avoid unnecessary Python loops over documents.
  • Embeddings may or may not be pre-normalized; handle cosine similarity correctly.
  • You may use NumPy/Pandas only (no external vector databases).

Describe your approach and implement the missing pieces (as if filling blanks in a notebook).

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Coding & Algorithms•More Harvey•More Software Engineer•Harvey Software Engineer•Harvey Coding & Algorithms•Software Engineer Coding & Algorithms
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.