Find top-5 most similar rows across datasets

Q: Find top-5 most similar rows across datasets

This question evaluates understanding and implementation of similarity/distance metrics (e.g., MSE), data preprocessing issues such as handling NULLs and feature scaling, and the ability to perform efficient cross-dataset comparisons and top-k ranking.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

You can solve this in SQL or Python.

You are given two datasets with the same feature columns:

Tables

target_rows (rows you want to match)

target_id (STRING, PK)
f1, f2, ..., fk (DOUBLE; numeric features; may contain NULLs)

candidate_rows (rows to search)

candidate_id (STRING, PK)
f1, f2, ..., fk (DOUBLE; numeric features; may contain NULLs)

Task

For each row in target_rows, find the top 5 most similar rows in candidate_rows using all features.

Define a similarity/distance metric using all features (e.g., MSE across features).
Compute the distance between each target row and candidate row.
Return the 5 candidates with the smallest distance per target.

Distance definition (use this unless you clearly state an alternative)

Let the feature set be $\{f_1,\dots,f_k\}$ . Define distance as:

$\text{MSE}(t,c) = \frac{1}{k}\sum_{j=1}^{k} (t.f_j - c.f_j)^2$

Assumptions you should clarify in your solution:

How you handle NULL features (e.g., drop those dimensions for that pair, or impute).
Whether you need feature scaling/standardization before computing distances.

Output

Return:

target_id
candidate_id
distance (smaller = more similar)
rank (1 to 5 per target_id , where 1 is most similar)

Order results by target_id, then rank.

Find top-5 most similar rows across datasets

Overview

Tables

Task

Distance definition (use this unless you clearly state an alternative)

Output

Comments (0)