You are given a list of user records. Each record has fields:
-
id
(unique)
-
name
-
email
-
company
You are also given:
-
weights
: a map from field name to weight (e.g.,
name: 0.2, email: 0.5, company: 0.3
)
-
threshold
: a float
-
target_user_id
Similarity scoring
Define similarity(recordA, recordB) as the sum over fields of:
-
weights[field] * field_similarity(field_valueA, field_valueB)
where field_similarity returns a value in [0,1] (the exact function is provided/assumed in the interview; for example, exact match => 1, otherwise 0; or a string similarity).
Two records are considered linked if their total similarity score is >= threshold.
Task
Return all record IDs that should be considered the same user as target_user_id.
Follow-up 1: include 1-hop indirect links
Include not only records directly linked to the target, but also records linked to those direct matches (i.e., within 2 steps from the target), even if they are not directly linked to the target.
Follow-up 2: include all indirect links (connected component)
Return all record IDs in the entire connected component containing target_user_id, where edges connect pairs of records whose similarity is >= threshold.
Notes
-
Clarify whether the output includes the target ID itself.
-
Aim for an approach that avoids unnecessary pairwise comparisons when possible (discuss indexing/blocking if relevant).