How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a hard difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Databricks.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Databricks during technical interviews.

Find top-5 most similar rows across datasets

Quick Overview

This Coding & Algorithms problem in the Data Science domain evaluates similarity search and ranking skills, including selection and application of distance metrics (e.g., MSE, RMSE, cosine or Manhattan), handling of NULLs, feature scaling/weighting, and efficient top‑k retrieval across datasets.

Company: Databricks

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Technical Screen

You are given two datasets with the same feature columns: - `source` (rows you want to match): - `source_id` (STRING/INT) - `f1...fk` (NUMERIC; may contain NULLs) - `target` (candidate rows to search): - `target_id` (STRING/INT) - `f1...fk` (NUMERIC; may contain NULLs) Task (choose Python or SQL): For each row in `source`, find the **top 5 rows in `target` that are most similar** when considering **all features**. Requirements / clarifications: 1. Define a similarity (or distance) metric. If you use mean squared error (MSE) across features, specify it precisely: - how you treat NULLs, - whether you scale/standardize features, - whether you weight features equally. 2. Output columns: - `source_id`, `target_id`, `distance` (smaller = more similar), `rank` (1–5 per `source_id`). 3. Be prepared to discuss tradeoffs of **MSE vs RMSE** (and any alternative metric you propose, e.g., cosine distance, Manhattan distance).

Quick Answer: This Coding & Algorithms problem in the Data Science domain evaluates similarity search and ranking skills, including selection and application of distance metrics (e.g., MSE, RMSE, cosine or Manhattan), handling of NULLs, feature scaling/weighting, and efficient top‑k retrieval across datasets.

You are given two datasets with the same feature columns. Each row in `source` has a `source_id` and numeric features `f1...fk`, and each row in `target` has a `target_id` and the same feature columns. Implement `solution(source, target)` to find the top 5 most similar target rows for every source row. Use mean squared error (MSE) as the distance metric: - Compare all feature columns except the ID column. - Ignore any feature where either value is `None`. - All remaining comparable features are weighted equally. - Do not scale or standardize features. - If a source-target pair has no comparable features, skip that pair entirely. For a valid pair, if `C` is the set of comparable features, then: `distance = sum((source[f] - target[f]) ** 2 for f in C) / len(C)` Smaller distance means more similar. For each `source_id`, return up to 5 matches ordered by ascending distance; break ties by ascending `target_id`. The final output should be a single list of tuples `(source_id, target_id, distance, rank)` in source-row order, with ranks starting at 1 for each source row. Discussion note: be ready to explain why MSE and RMSE produce the same ranking for fixed pairs (square root is monotonic), why RMSE is easier to interpret in original units, and how alternatives like Manhattan or cosine distance change behavior.

Constraints

0 <= len(source) <= 200
0 <= len(target) <= 2000
1 <= number of feature columns <= 30
Each feature value is numeric or None
All rows in both datasets share the same feature columns
No feature scaling/standardization is applied; all comparable features are equally weighted
IDs within a dataset are of a single comparable type, so ties can be broken by ascending target_id

Examples

Input: ([{'source_id': 'S1', 'f1': 1, 'f2': 2}, {'source_id': 'S2', 'f1': 3, 'f2': 4}], [{'target_id': 'T1', 'f1': 1, 'f2': 2}, {'target_id': 'T2', 'f1': 2, 'f2': 2}, {'target_id': 'T3', 'f1': 0, 'f2': 2}, {'target_id': 'T4', 'f1': 3, 'f2': 5}, {'target_id': 'T5', 'f1': 10, 'f2': 10}, {'target_id': 'T6', 'f1': 3, 'f2': 4}])

Expected Output: [('S1', 'T1', 0.0, 1), ('S1', 'T2', 0.5, 2), ('S1', 'T3', 0.5, 3), ('S1', 'T6', 4.0, 4), ('S1', 'T4', 6.5, 5), ('S2', 'T6', 0.0, 1), ('S2', 'T4', 0.5, 2), ('S2', 'T2', 2.5, 3), ('S2', 'T1', 4.0, 4), ('S2', 'T3', 6.5, 5)]

Explanation: Distances are computed with raw MSE over both features. For S1, T1 is exact, T2 and T3 tie at 0.5 so target_id breaks the tie, and only the best 5 are kept.

Input: ([{'source_id': 1, 'f1': 1, 'f2': None, 'f3': 5}], [{'target_id': 10, 'f1': 1, 'f2': 2, 'f3': None}, {'target_id': 11, 'f1': 2, 'f2': None, 'f3': 7}, {'target_id': 12, 'f1': None, 'f2': 2, 'f3': 5}, {'target_id': 13, 'f1': None, 'f2': None, 'f3': None}, {'target_id': 14, 'f1': 3, 'f2': 0, 'f3': 1}])

Expected Output: [(1, 10, 0.0, 1), (1, 12, 0.0, 2), (1, 11, 2.5, 3), (1, 14, 10.0, 4)]

Explanation: Features with None on either side are ignored. Target 13 has no comparable features at all, so it is skipped. Only 4 valid matches remain.

Input: ([], [{'target_id': 1, 'f1': 3, 'f2': 4}])

Expected Output: []

Explanation: If there are no source rows, there are no matches to return.

Input: ([{'source_id': 'A', 'f1': -1, 'f2': 2}], [{'target_id': 'X', 'f1': -1, 'f2': 2}, {'target_id': 'Y', 'f1': -2, 'f2': 2}, {'target_id': 'Z', 'f1': -1, 'f2': 0}])

Expected Output: [('A', 'X', 0.0, 1), ('A', 'Y', 0.5, 2), ('A', 'Z', 2.0, 3)]

Explanation: This case includes negative numbers and fewer than 5 targets. Return all valid matches ranked by increasing MSE.

Hints

For each source row, compare it against every target row, but only over features where both values are not None.
Once you compute all valid distances for one source row, sort by (distance, target_id) and keep only the first 5.

Quick Overview