Implement retrieval and evaluation for a simple RAG

Q: Implement retrieval and evaluation for a simple RAG

This question evaluates understanding of embedding-based retrieval, cosine similarity ranking, and retrieval evaluation metrics (Recall@10 and MRR@10), along with efficient NumPy/Pandas-based data processing for vector search in the Coding & Algorithms domain.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Q: What difficulty level is this interview question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Onsite rounds at Harvey.

Q: What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Harvey during technical interviews.

Question

Loading...

Task

You are given a small “toy RAG” notebook setup with helper utilities.

You have a corpus of text passages stored in a pandas DataFrame docs_df with columns:

doc_id (string/int)
text (string)

You are also given:

A utility function embed_texts(texts: List[str]) -> np.ndarray that returns a 2D array of embeddings with shape (n, d) .
A DataFrame queries_df with columns:
- query_id
- query_text
- relevant_doc_ids (a list of doc_ids that are considered relevant for that query)

Part A — Build document embeddings

Compute embeddings for every row in docs_df and store them so they can be used for retrieval.

Part B — Retrieve top-k by cosine similarity

For each query:

Compute the query embedding.
Compute cosine similarity between the query embedding and all document embeddings.
Return the top 10 documents ranked by similarity.

Your output per query should include:

query_id
a ranked list of the 10 retrieved doc_id s
optionally the similarity scores

Part C — Evaluate retrieval

Implement evaluation over all queries using:

Recall@10 : fraction of queries where at least one relevant document appears in the top 10.
MRR@10 : mean reciprocal rank of the first relevant document within the top 10 (0 if none).

Constraints / Notes

Assume docs_df can be large enough that you should avoid unnecessary Python loops over documents.
Embeddings may or may not be pre-normalized; handle cosine similarity correctly.
You may use NumPy/Pandas only (no external vector databases).

Describe your approach and implement the missing pieces (as if filling blanks in a notebook).