Task
You are given a small “toy RAG” notebook setup with helper utilities.
You have a corpus of text passages stored in a pandas DataFrame docs_df with columns:
-
doc_id
(string/int)
-
text
(string)
You are also given:
-
A utility function
embed_texts(texts: List[str]) -> np.ndarray
that returns a 2D array of embeddings with shape
(n, d)
.
-
A DataFrame
queries_df
with columns:
-
query_id
-
query_text
-
relevant_doc_ids
(a list of doc_ids that are considered relevant for that query)
Part A — Build document embeddings
Compute embeddings for every row in docs_df and store them so they can be used for retrieval.
Part B — Retrieve top-k by cosine similarity
For each query:
-
Compute the query embedding.
-
Compute
cosine similarity
between the query embedding and all document embeddings.
-
Return the
top 10
documents ranked by similarity.
Your output per query should include:
-
query_id
-
a ranked list of the 10 retrieved
doc_id
s
-
optionally the similarity scores
Part C — Evaluate retrieval
Implement evaluation over all queries using:
-
Recall@10
: fraction of queries where at least one relevant document appears in the top 10.
-
MRR@10
: mean reciprocal rank of the first relevant document within the top 10 (0 if none).
Constraints / Notes
-
Assume
docs_df
can be large enough that you should avoid unnecessary Python loops over documents.
-
Embeddings may or may not be pre-normalized; handle cosine similarity correctly.
-
You may use NumPy/Pandas only (no external vector databases).
Describe your approach and implement the missing pieces (as if filling blanks in a notebook).