Find Top-5 Similar Rows
Company: Databricks
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: hard
Interview Round: Technical Screen
You are given two point-in-time snapshot tables generated on the same day in UTC. There is no direct key relationship between the tables; each row in the first table must be compared with every row in the second table.
Table `dataset_a`
- `row_id` STRING PRIMARY KEY
- `feature_1` DOUBLE
- `feature_2` DOUBLE
- `feature_3` DOUBLE
- `feature_4` DOUBLE
Table `dataset_b`
- `row_id` STRING PRIMARY KEY
- `feature_1` DOUBLE
- `feature_2` DOUBLE
- `feature_3` DOUBLE
- `feature_4` DOUBLE
For each row in `dataset_a`, find the 5 most similar rows in `dataset_b` using mean squared error (MSE) across the four numeric feature columns:
`MSE = ((a.feature_1 - b.feature_1)^2 + (a.feature_2 - b.feature_2)^2 + (a.feature_3 - b.feature_3)^2 + (a.feature_4 - b.feature_4)^2) / 4`
Lower MSE means higher similarity.
Requirements:
1. Compare every row in `dataset_a` with every row in `dataset_b`.
2. Exclude any pair where at least one of the compared feature values is `NULL`.
3. For each `dataset_a.row_id`, rank candidate matches by ascending `MSE`.
4. Break ties by ascending `dataset_b.row_id`.
5. Return the top 5 matches per source row.
Required output columns:
- `source_row_id` STRING
- `matched_row_id` STRING
- `mse` DOUBLE
- `similarity_rank` INT
Sort the final output by `source_row_id`, then `similarity_rank`.
You may solve this in SQL or pandas.
Quick Answer: This question evaluates pairwise similarity matching, numerical feature comparison using mean squared error, null-value exclusion, and deterministic top‑k ranking—key competencies in data manipulation and feature-based record linkage.