Find top-5 most similar rows across datasets
Company: Databricks
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: easy
Interview Round: Technical Screen
You can solve this in **SQL or Python**.
You are given two datasets with the same feature columns:
### Tables
**`target_rows`** (rows you want to match)
- `target_id` (STRING, PK)
- `f1, f2, ..., fk` (DOUBLE; numeric features; may contain NULLs)
**`candidate_rows`** (rows to search)
- `candidate_id` (STRING, PK)
- `f1, f2, ..., fk` (DOUBLE; numeric features; may contain NULLs)
### Task
For **each** row in `target_rows`, find the **top 5** most similar rows in `candidate_rows` using **all features**.
1. Define a similarity/distance metric using all features (e.g., **MSE** across features).
2. Compute the distance between each target row and candidate row.
3. Return the 5 candidates with the smallest distance per target.
### Distance definition (use this unless you clearly state an alternative)
Let the feature set be \(\{f_1,\dots,f_k\}\). Define distance as:
\[
\text{MSE}(t,c) = \frac{1}{k}\sum_{j=1}^{k} (t.f_j - c.f_j)^2
\]
Assumptions you should clarify in your solution:
- How you handle **NULL** features (e.g., drop those dimensions for that pair, or impute).
- Whether you need **feature scaling/standardization** before computing distances.
### Output
Return:
- `target_id`
- `candidate_id`
- `distance` (smaller = more similar)
- `rank` (1 to 5 per `target_id`, where 1 is most similar)
Order results by `target_id`, then `rank`.
Quick Answer: This question evaluates understanding and implementation of similarity/distance metrics (e.g., MSE), data preprocessing issues such as handling NULLs and feature scaling, and the ability to perform efficient cross-dataset comparisons and top-k ranking.