Find top-5 most similar rows across datasets
Company: Databricks
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Technical Screen
Quick Answer: This Coding & Algorithms problem in the Data Science domain evaluates similarity search and ranking skills, including selection and application of distance metrics (e.g., MSE, RMSE, cosine or Manhattan), handling of NULLs, feature scaling/weighting, and efficient top‑k retrieval across datasets.
Constraints
- 0 <= len(source) <= 200
- 0 <= len(target) <= 2000
- 1 <= number of feature columns <= 30
- Each feature value is numeric or None
- All rows in both datasets share the same feature columns
- No feature scaling/standardization is applied; all comparable features are equally weighted
- IDs within a dataset are of a single comparable type, so ties can be broken by ascending target_id
Examples
Input: ([{'source_id': 'S1', 'f1': 1, 'f2': 2}, {'source_id': 'S2', 'f1': 3, 'f2': 4}], [{'target_id': 'T1', 'f1': 1, 'f2': 2}, {'target_id': 'T2', 'f1': 2, 'f2': 2}, {'target_id': 'T3', 'f1': 0, 'f2': 2}, {'target_id': 'T4', 'f1': 3, 'f2': 5}, {'target_id': 'T5', 'f1': 10, 'f2': 10}, {'target_id': 'T6', 'f1': 3, 'f2': 4}])
Expected Output: [('S1', 'T1', 0.0, 1), ('S1', 'T2', 0.5, 2), ('S1', 'T3', 0.5, 3), ('S1', 'T6', 4.0, 4), ('S1', 'T4', 6.5, 5), ('S2', 'T6', 0.0, 1), ('S2', 'T4', 0.5, 2), ('S2', 'T2', 2.5, 3), ('S2', 'T1', 4.0, 4), ('S2', 'T3', 6.5, 5)]
Explanation: Distances are computed with raw MSE over both features. For S1, T1 is exact, T2 and T3 tie at 0.5 so target_id breaks the tie, and only the best 5 are kept.
Input: ([{'source_id': 1, 'f1': 1, 'f2': None, 'f3': 5}], [{'target_id': 10, 'f1': 1, 'f2': 2, 'f3': None}, {'target_id': 11, 'f1': 2, 'f2': None, 'f3': 7}, {'target_id': 12, 'f1': None, 'f2': 2, 'f3': 5}, {'target_id': 13, 'f1': None, 'f2': None, 'f3': None}, {'target_id': 14, 'f1': 3, 'f2': 0, 'f3': 1}])
Expected Output: [(1, 10, 0.0, 1), (1, 12, 0.0, 2), (1, 11, 2.5, 3), (1, 14, 10.0, 4)]
Explanation: Features with None on either side are ignored. Target 13 has no comparable features at all, so it is skipped. Only 4 valid matches remain.
Input: ([], [{'target_id': 1, 'f1': 3, 'f2': 4}])
Expected Output: []
Explanation: If there are no source rows, there are no matches to return.
Input: ([{'source_id': 'A', 'f1': -1, 'f2': 2}], [{'target_id': 'X', 'f1': -1, 'f2': 2}, {'target_id': 'Y', 'f1': -2, 'f2': 2}, {'target_id': 'Z', 'f1': -1, 'f2': 0}])
Expected Output: [('A', 'X', 0.0, 1), ('A', 'Y', 0.5, 2), ('A', 'Z', 2.0, 3)]
Explanation: This case includes negative numbers and fewer than 5 targets. Return all valid matches ranked by increasing MSE.
Hints
- For each source row, compare it against every target row, but only over features where both values are not None.
- Once you compute all valid distances for one source row, sort by (distance, target_id) and keep only the first 5.