PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches

Quick Overview

This question evaluates understanding and implementation of similarity/distance metrics (e.g., MSE), data preprocessing issues such as handling NULLs and feature scaling, and the ability to perform efficient cross-dataset comparisons and top-k ranking.

  • easy
  • Databricks
  • Data Manipulation (SQL/Python)
  • Data Scientist

Find top-5 most similar rows across datasets

Company: Databricks

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: easy

Interview Round: Technical Screen

You can solve this in **SQL or Python**. You are given two datasets with the same feature columns: ### Tables **`target_rows`** (rows you want to match) - `target_id` (STRING, PK) - `f1, f2, ..., fk` (DOUBLE; numeric features; may contain NULLs) **`candidate_rows`** (rows to search) - `candidate_id` (STRING, PK) - `f1, f2, ..., fk` (DOUBLE; numeric features; may contain NULLs) ### Task For **each** row in `target_rows`, find the **top 5** most similar rows in `candidate_rows` using **all features**. 1. Define a similarity/distance metric using all features (e.g., **MSE** across features). 2. Compute the distance between each target row and candidate row. 3. Return the 5 candidates with the smallest distance per target. ### Distance definition (use this unless you clearly state an alternative) Let the feature set be \(\{f_1,\dots,f_k\}\). Define distance as: \[ \text{MSE}(t,c) = \frac{1}{k}\sum_{j=1}^{k} (t.f_j - c.f_j)^2 \] Assumptions you should clarify in your solution: - How you handle **NULL** features (e.g., drop those dimensions for that pair, or impute). - Whether you need **feature scaling/standardization** before computing distances. ### Output Return: - `target_id` - `candidate_id` - `distance` (smaller = more similar) - `rank` (1 to 5 per `target_id`, where 1 is most similar) Order results by `target_id`, then `rank`.

Quick Answer: This question evaluates understanding and implementation of similarity/distance metrics (e.g., MSE), data preprocessing issues such as handling NULLs and feature scaling, and the ability to perform efficient cross-dataset comparisons and top-k ranking.

Last updated: Mar 29, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Find Top-5 Similar Rows - Databricks (hard)
  • Count weekly customers with YTD spend ≥ $1000 - Databricks (hard)
  • Count weekly customers with ≥$1000 YTD spend - Databricks (easy)
  • Calculate Second-Degree Followers for Each YouTuber - Databricks (Medium)