You are given a list of user records. Each record has the following fields:
id
(integer, unique)
name
(string)
email
(string)
company
(string)
You are also given a map weights from field name to a non-negative floating-point weight, for example:
Define the similarity score between two records A and B as:
{name, email, company}
:
A.field == B.field
(string equality), add
weights[field]
to the score.
0
.
Two records are said to be directly linked if their similarity score is greater than or equal to a given threshold (where 0 <= threshold <= 1). Treat the relation as undirected: if record i is directly linked to record j, then j is directly linked to i.
You are given:
rows
: a list of user records as described above.
weights
: the map from field name to weight.
threshold
: a floating-point number.
target_user_id
: the
id
of a record in
rows
.
Assume:
id
values are unique.
rows
fits in memory (you do not need to design a distributed system).
Example:
Under the definition above:
email
and
company
, so their similarity is
0.5 + 0.3 = 0.8 >= 0.5
→ directly linked.
name
, similarity is
0.2 < 0.5
→
not
directly linked.
Implement a function that, given rows, weights, threshold, and target_user_id, returns all record IDs whose records are directly linked to the target user's record (based solely on the similarity score definition).
You can choose any reasonable function signature, such as:
The result should not include target_user_id itself.
Now extend the requirement.
Construct an undirected graph where:
id
).
For a given target_user_id, you must now return all record IDs that are:
Example scenario:
In this case, for target_user_id = 1, your output should include {2, 3}: record 2 is a direct match; record 3 is a one-hop indirect match through 2.
Design and implement this extended query.
Further extend the requirement so that, for a given target_user_id, you return all record IDs in the same connected component as the target in the graph defined above.
In other words, return every record that is reachable from the target by any number of edges (0 or more), based on the direct-link relation.
Implement a function such as:
that returns the IDs of all records (excluding or including the target itself — specify your choice) that belong to the same connected component as the target.