This question evaluates skills in similarity-based record linkage, weighted field scoring, and graph connectivity analysis within the Coding & Algorithms domain, examining competency in designing scalable matching strategies, thresholded similarity, and handling direct and indirect links between records.
You are given a list of user records. Each record has fields:
id
(unique)
name
email
company
You are also given:
weights
: a map from field name to weight (e.g.,
name: 0.2, email: 0.5, company: 0.3
)
threshold
: a float
target_user_id
Define similarity(recordA, recordB) as the sum over fields of:
weights[field] * field_similarity(field_valueA, field_valueB)
where field_similarity returns a value in [0,1] (the exact function is provided/assumed in the interview; for example, exact match => 1, otherwise 0; or a string similarity).
Two records are considered linked if their total similarity score is >= threshold.
Return all record IDs that should be considered the same user as target_user_id.
Include not only records directly linked to the target, but also records linked to those direct matches (i.e., within 2 steps from the target), even if they are not directly linked to the target.
Return all record IDs in the entire connected component containing target_user_id, where edges connect pairs of records whose similarity is >= threshold.