Find top-5 most similar rows across datasets

Q: Find top-5 most similar rows across datasets

This Coding & Algorithms problem in the Data Science domain evaluates similarity search and ranking skills, including selection and application of distance metrics (e.g., MSE, RMSE, cosine or Manhattan), handling of NULLs, feature scaling/weighting, and efficient top‑k retrieval across datasets.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

You are given two datasets with the same feature columns:

source (rows you want to match):
- source_id (STRING/INT)
- f1...fk (NUMERIC; may contain NULLs)
target (candidate rows to search):
- target_id (STRING/INT)
- f1...fk (NUMERIC; may contain NULLs)

Task (choose Python or SQL): For each row in source, find the top 5 rows in target that are most similar when considering all features.

Requirements / clarifications:

Define a similarity (or distance) metric. If you use mean squared error (MSE) across features, specify it precisely:
- how you treat NULLs,
- whether you scale/standardize features,
- whether you weight features equally.
Output columns:
- source_id , target_id , distance (smaller = more similar), rank (1–5 per source_id ).
Be prepared to discuss tradeoffs of MSE vs RMSE (and any alternative metric you propose, e.g., cosine distance, Manhattan distance).

Find top-5 most similar rows across datasets

Quick Overview

Comments (0)