PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches

Quick Overview

This Coding & Algorithms problem in the Data Science domain evaluates similarity search and ranking skills, including selection and application of distance metrics (e.g., MSE, RMSE, cosine or Manhattan), handling of NULLs, feature scaling/weighting, and efficient top‑k retrieval across datasets.

  • hard
  • Databricks
  • Coding & Algorithms
  • Data Scientist

Find top-5 most similar rows across datasets

Company: Databricks

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Technical Screen

You are given two datasets with the same feature columns: - `source` (rows you want to match): - `source_id` (STRING/INT) - `f1...fk` (NUMERIC; may contain NULLs) - `target` (candidate rows to search): - `target_id` (STRING/INT) - `f1...fk` (NUMERIC; may contain NULLs) Task (choose Python or SQL): For each row in `source`, find the **top 5 rows in `target` that are most similar** when considering **all features**. Requirements / clarifications: 1. Define a similarity (or distance) metric. If you use mean squared error (MSE) across features, specify it precisely: - how you treat NULLs, - whether you scale/standardize features, - whether you weight features equally. 2. Output columns: - `source_id`, `target_id`, `distance` (smaller = more similar), `rank` (1–5 per `source_id`). 3. Be prepared to discuss tradeoffs of **MSE vs RMSE** (and any alternative metric you propose, e.g., cosine distance, Manhattan distance).

Quick Answer: This Coding & Algorithms problem in the Data Science domain evaluates similarity search and ranking skills, including selection and application of distance metrics (e.g., MSE, RMSE, cosine or Manhattan), handling of NULLs, feature scaling/weighting, and efficient top‑k retrieval across datasets.

You are given two datasets with the same feature columns. Each row in `source` has a `source_id` and numeric features `f1...fk`, and each row in `target` has a `target_id` and the same feature columns. Implement `solution(source, target)` to find the top 5 most similar target rows for every source row. Use mean squared error (MSE) as the distance metric: - Compare all feature columns except the ID column. - Ignore any feature where either value is `None`. - All remaining comparable features are weighted equally. - Do not scale or standardize features. - If a source-target pair has no comparable features, skip that pair entirely. For a valid pair, if `C` is the set of comparable features, then: `distance = sum((source[f] - target[f]) ** 2 for f in C) / len(C)` Smaller distance means more similar. For each `source_id`, return up to 5 matches ordered by ascending distance; break ties by ascending `target_id`. The final output should be a single list of tuples `(source_id, target_id, distance, rank)` in source-row order, with ranks starting at 1 for each source row. Discussion note: be ready to explain why MSE and RMSE produce the same ranking for fixed pairs (square root is monotonic), why RMSE is easier to interpret in original units, and how alternatives like Manhattan or cosine distance change behavior.

Constraints

  • 0 <= len(source) <= 200
  • 0 <= len(target) <= 2000
  • 1 <= number of feature columns <= 30
  • Each feature value is numeric or None
  • All rows in both datasets share the same feature columns
  • No feature scaling/standardization is applied; all comparable features are equally weighted
  • IDs within a dataset are of a single comparable type, so ties can be broken by ascending target_id

Examples

Input: ([{'source_id': 'S1', 'f1': 1, 'f2': 2}, {'source_id': 'S2', 'f1': 3, 'f2': 4}], [{'target_id': 'T1', 'f1': 1, 'f2': 2}, {'target_id': 'T2', 'f1': 2, 'f2': 2}, {'target_id': 'T3', 'f1': 0, 'f2': 2}, {'target_id': 'T4', 'f1': 3, 'f2': 5}, {'target_id': 'T5', 'f1': 10, 'f2': 10}, {'target_id': 'T6', 'f1': 3, 'f2': 4}])

Expected Output: [('S1', 'T1', 0.0, 1), ('S1', 'T2', 0.5, 2), ('S1', 'T3', 0.5, 3), ('S1', 'T6', 4.0, 4), ('S1', 'T4', 6.5, 5), ('S2', 'T6', 0.0, 1), ('S2', 'T4', 0.5, 2), ('S2', 'T2', 2.5, 3), ('S2', 'T1', 4.0, 4), ('S2', 'T3', 6.5, 5)]

Explanation: Distances are computed with raw MSE over both features. For S1, T1 is exact, T2 and T3 tie at 0.5 so target_id breaks the tie, and only the best 5 are kept.

Input: ([{'source_id': 1, 'f1': 1, 'f2': None, 'f3': 5}], [{'target_id': 10, 'f1': 1, 'f2': 2, 'f3': None}, {'target_id': 11, 'f1': 2, 'f2': None, 'f3': 7}, {'target_id': 12, 'f1': None, 'f2': 2, 'f3': 5}, {'target_id': 13, 'f1': None, 'f2': None, 'f3': None}, {'target_id': 14, 'f1': 3, 'f2': 0, 'f3': 1}])

Expected Output: [(1, 10, 0.0, 1), (1, 12, 0.0, 2), (1, 11, 2.5, 3), (1, 14, 10.0, 4)]

Explanation: Features with None on either side are ignored. Target 13 has no comparable features at all, so it is skipped. Only 4 valid matches remain.

Input: ([], [{'target_id': 1, 'f1': 3, 'f2': 4}])

Expected Output: []

Explanation: If there are no source rows, there are no matches to return.

Input: ([{'source_id': 'A', 'f1': -1, 'f2': 2}], [{'target_id': 'X', 'f1': -1, 'f2': 2}, {'target_id': 'Y', 'f1': -2, 'f2': 2}, {'target_id': 'Z', 'f1': -1, 'f2': 0}])

Expected Output: [('A', 'X', 0.0, 1), ('A', 'Y', 0.5, 2), ('A', 'Z', 2.0, 3)]

Explanation: This case includes negative numbers and fewer than 5 targets. Return all valid matches ranked by increasing MSE.

Hints

  1. For each source row, compare it against every target row, but only over features where both values are not None.
  2. Once you compute all valid distances for one source row, sort by (distance, target_id) and keep only the first 5.
Last updated: Apr 19, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Design In-Memory QPS Counter - Databricks (medium)
  • Implement Random Connectivity and Grid Routing - Databricks
  • Implement a Snapshot Set Iterator - Databricks (hard)
  • Find Fastest Commute Mode - Databricks (hard)
  • Solve Grid Path and Graph Sampling - Databricks (medium)