PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCareers

Quick Overview

This question evaluates pairwise similarity matching, numerical feature comparison using mean squared error, null-value exclusion, and deterministic top‑k ranking—key competencies in data manipulation and feature-based record linkage.

  • hard
  • Databricks
  • Data Manipulation (SQL/Python)
  • Data Scientist

Find Top-5 Similar Rows

Company: Databricks

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: hard

Interview Round: Technical Screen

You are given two point-in-time snapshot tables generated on the same day in UTC. There is no direct key relationship between the tables; each row in the first table must be compared with every row in the second table. Table `dataset_a` - `row_id` STRING PRIMARY KEY - `feature_1` DOUBLE - `feature_2` DOUBLE - `feature_3` DOUBLE - `feature_4` DOUBLE Table `dataset_b` - `row_id` STRING PRIMARY KEY - `feature_1` DOUBLE - `feature_2` DOUBLE - `feature_3` DOUBLE - `feature_4` DOUBLE For each row in `dataset_a`, find the 5 most similar rows in `dataset_b` using mean squared error (MSE) across the four numeric feature columns: `MSE = ((a.feature_1 - b.feature_1)^2 + (a.feature_2 - b.feature_2)^2 + (a.feature_3 - b.feature_3)^2 + (a.feature_4 - b.feature_4)^2) / 4` Lower MSE means higher similarity. Requirements: 1. Compare every row in `dataset_a` with every row in `dataset_b`. 2. Exclude any pair where at least one of the compared feature values is `NULL`. 3. For each `dataset_a.row_id`, rank candidate matches by ascending `MSE`. 4. Break ties by ascending `dataset_b.row_id`. 5. Return the top 5 matches per source row. Required output columns: - `source_row_id` STRING - `matched_row_id` STRING - `mse` DOUBLE - `similarity_rank` INT Sort the final output by `source_row_id`, then `similarity_rank`. You may solve this in SQL or pandas.

Quick Answer: This question evaluates pairwise similarity matching, numerical feature comparison using mean squared error, null-value exclusion, and deterministic top‑k ranking—key competencies in data manipulation and feature-based record linkage.

Last updated: May 7, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • Careers
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Count weekly customers with YTD spend ≥ $1000 - Databricks (hard)
  • Find top-5 most similar rows across datasets - Databricks (easy)
  • Count weekly customers with ≥$1000 YTD spend - Databricks (easy)
  • Calculate Second-Degree Followers for Each YouTuber - Databricks (Medium)