This question evaluates understanding and implementation of similarity/distance metrics (e.g., MSE), data preprocessing issues such as handling NULLs and feature scaling, and the ability to perform efficient cross-dataset comparisons and top-k ranking.
You can solve this in SQL or Python.
You are given two datasets with the same feature columns:
target_rows (rows you want to match)
target_id
(STRING, PK)
f1, f2, ..., fk
(DOUBLE; numeric features; may contain NULLs)
candidate_rows (rows to search)
candidate_id
(STRING, PK)
f1, f2, ..., fk
(DOUBLE; numeric features; may contain NULLs)
For each row in target_rows, find the top 5 most similar rows in candidate_rows using all features.
Let the feature set be . Define distance as:
Assumptions you should clarify in your solution:
Return:
target_id
candidate_id
distance
(smaller = more similar)
rank
(1 to 5 per
target_id
, where 1 is most similar)
Order results by target_id, then rank.