Find linked user records by similarity
Company: Stripe
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Quick Answer: This question evaluates the ability to compute weighted field-based similarity and reason about adjacency relationships in an undirected graph, testing competencies in similarity scoring, record linkage, and basic graph traversal.
Find linked user records by similarity — Part 1: Direct matches
Constraints
- id values are unique integers.
- 0 <= threshold <= 1.
- weights are non-negative floats over the fields {name, email, company}.
- rows fits in memory.
- target_user_id is the id of a record present in rows.
Examples
Input: ([{'id': 1, 'name': 'Alice', 'email': 'alice@gmail.com', 'company': 'Stripe'}, {'id': 2, 'name': 'Alicia', 'email': 'alice@gmail.com', 'company': 'Stripe'}, {'id': 3, 'name': 'Alice', 'email': 'alice@yahoo.com', 'company': 'Google'}, {'id': 4, 'name': 'Bob', 'email': 'bob@gmail.com', 'company': 'Stripe'}], {'name': 0.2, 'email': 0.5, 'company': 0.3}, 0.5, 1)
Expected Output: [2]
Explanation: Records 1 and 2 share email (0.5) and company (0.3) = 0.8 >= 0.5. Record 3 shares only name (0.2) and record 4 shares only company (0.3), both below 0.5.
Input: ([{'id': 1, 'name': 'Alice', 'email': 'alice@gmail.com', 'company': 'Stripe'}, {'id': 2, 'name': 'Alicia', 'email': 'alice@gmail.com', 'company': 'Stripe'}, {'id': 3, 'name': 'Alice', 'email': 'alice@yahoo.com', 'company': 'Google'}, {'id': 4, 'name': 'Bob', 'email': 'bob@gmail.com', 'company': 'Stripe'}], {'name': 0.2, 'email': 0.5, 'company': 0.3}, 0.2, 1)
Expected Output: [2, 3, 4]
Explanation: With threshold 0.2 every record sharing at least the name (3), company (4), or email+company (2) clears the bar.
Input: ([{'id': 1, 'name': 'Alice', 'email': 'alice@gmail.com', 'company': 'Stripe'}], {'name': 0.2, 'email': 0.5, 'company': 0.3}, 0.5, 1)
Expected Output: []
Explanation: Only the target exists, so there is nothing to link to.
Input: ([{'id': 10, 'name': 'X', 'email': 'x@a.com', 'company': 'C'}, {'id': 20, 'name': 'X', 'email': 'x@a.com', 'company': 'C'}], {'name': 0.2, 'email': 0.5, 'company': 0.3}, 1.0, 10)
Expected Output: [20]
Explanation: Threshold 1.0 requires all three fields to match; records 10 and 20 are identical so similarity is exactly 1.0.
Input: ([{'id': 1, 'name': 'A', 'email': 'a', 'company': 'c'}, {'id': 2, 'name': 'B', 'email': 'b', 'company': 'd'}], {'name': 0.2, 'email': 0.5, 'company': 0.3}, 0.0, 1)
Expected Output: [2]
Explanation: Threshold 0.0 links every other record regardless of shared fields (0 >= 0).
Hints
- Locate the target record first; if it is absent, return an empty list.
- Similarity only depends on exact-string equality per field, so you just sum the weights of the matching fields.
- Skip the target itself, and sort the resulting IDs before returning for a deterministic answer.
Find linked user records by similarity — Part 2: Up to one indirect hop
Constraints
- id values are unique integers.
- 0 <= threshold <= 1.
- Edges are undirected: i linked to j implies j linked to i.
- Reachability bound is exactly 2 edges (direct + one indirect hop).
- target_user_id is present in rows.
Examples
Input: ([{'id': 1, 'name': 'Alice', 'email': 'alice@gmail.com', 'company': 'Stripe'}, {'id': 2, 'name': 'Alicia', 'email': 'alice@gmail.com', 'company': 'Stripe'}, {'id': 3, 'name': 'Alicia', 'email': 'alice@yahoo.com', 'company': 'Stripe'}, {'id': 4, 'name': 'Bob', 'email': 'bob@x.com', 'company': 'Google'}], {'name': 0.2, 'email': 0.5, 'company': 0.3}, 0.5, 1)
Expected Output: [2, 3]
Explanation: 1-2 linked via email+company (0.8). 2-3 linked via name+company (0.5). 1-3 not directly linked (only company, 0.3). So 3 is reached through 2 in two hops; 4 is isolated.
Input: ([{'id': 1, 'name': 'Alice', 'email': 'alice@gmail.com', 'company': 'Stripe'}, {'id': 2, 'name': 'Alicia', 'email': 'alice@gmail.com', 'company': 'Stripe'}, {'id': 3, 'name': 'Alice', 'email': 'alice@yahoo.com', 'company': 'Google'}, {'id': 4, 'name': 'Bob', 'email': 'bob@gmail.com', 'company': 'Stripe'}], {'name': 0.2, 'email': 0.5, 'company': 0.3}, 0.5, 1)
Expected Output: [2]
Explanation: Only 1-2 clears the threshold and 2 has no further qualifying neighbors, so two-hop reach is just [2].
Input: ([{'id': 1, 'name': 'A', 'email': 'e1', 'company': 'X'}, {'id': 2, 'name': 'A', 'email': 'e2', 'company': 'X'}, {'id': 3, 'name': 'A', 'email': 'e3', 'company': 'X'}, {'id': 4, 'name': 'A', 'email': 'e4', 'company': 'X'}], {'name': 0.2, 'email': 0.5, 'company': 0.3}, 0.5, 1)
Expected Output: [2, 3, 4]
Explanation: Every pair shares name+company (0.5), so the graph is a clique; all other nodes are within one hop of the target.
Input: ([{'id': 1, 'name': 'A', 'email': 'e', 'company': 'c'}], {'name': 0.2, 'email': 0.5, 'company': 0.3}, 0.5, 1)
Expected Output: []
Explanation: Single record has no neighbors.
Input: ([{'id': 1, 'name': 'A', 'email': 'e1', 'company': 'c1'}, {'id': 2, 'name': 'B', 'email': 'e2', 'company': 'c2'}], {'name': 0.2, 'email': 0.5, 'company': 0.3}, 0.5, 1)
Expected Output: []
Explanation: The two records share nothing, so the target has no links at all.
Input: ([{'id': 1, 'name': 'n1', 'email': 'shared', 'company': 'c1'}, {'id': 2, 'name': 'n2', 'email': 'shared', 'company': 'c2'}, {'id': 3, 'name': 'n3', 'email': 'shared', 'company': 'c3'}, {'id': 4, 'name': 'n4', 'email': 'shared', 'company': 'c4'}, {'id': 5, 'name': 'lonely', 'email': 'lonely', 'company': 'lonely'}], {'name': 0.2, 'email': 0.5, 'company': 0.3}, 0.5, 1)
Expected Output: [2, 3, 4]
Explanation: Records 1-4 all share the same email (0.5) forming a clique; node 5 is isolated and unreachable.
Hints
- First materialize the direct-link graph by comparing all pairs once (similarity is symmetric).
- Collect the target's direct neighbors, then union in each direct neighbor's neighbors — that is precisely the set of nodes within 2 edges.
- Remember to drop the target id from the final set (it shows up as a neighbor of its own neighbors).
Find linked user records by similarity — Part 3: Full connected component
Constraints
- id values are unique integers.
- 0 <= threshold <= 1.
- Component membership is the transitive closure of the undirected direct-link relation.
- The result excludes target_user_id (it is reachable from itself by definition).
- target_user_id is present in rows.
Examples
Input: ([{'id': 1, 'name': 'a', 'email': 'shared1', 'company': 'x'}, {'id': 2, 'name': 'b', 'email': 'shared1', 'company': 'k'}, {'id': 3, 'name': 'b', 'email': 'zzz', 'company': 'k'}, {'id': 4, 'name': 'b', 'email': 'qqq', 'company': 'k'}, {'id': 5, 'name': 'lonely', 'email': 'solo', 'company': 'solo'}], {'name': 0.2, 'email': 0.5, 'company': 0.3}, 0.5, 1)
Expected Output: [2, 3, 4]
Explanation: 1-2 linked via email (0.5). 2-3, 2-4, 3-4 linked via name+company (0.5). 1 is NOT directly linked to 3 or 4 (no shared fields) yet they are in the same component, so transitive traversal returns [2,3,4]; node 5 is isolated.
Input: ([{'id': 1, 'name': 'Alice', 'email': 'alice@gmail.com', 'company': 'Stripe'}, {'id': 2, 'name': 'Alicia', 'email': 'alice@gmail.com', 'company': 'Stripe'}, {'id': 3, 'name': 'Alice', 'email': 'alice@yahoo.com', 'company': 'Google'}, {'id': 4, 'name': 'Bob', 'email': 'bob@gmail.com', 'company': 'Stripe'}], {'name': 0.2, 'email': 0.5, 'company': 0.3}, 0.5, 1)
Expected Output: [2]
Explanation: Only 1-2 are linked; that two-node component minus the target is [2].
Input: ([{'id': 1, 'name': 'A', 'email': 'e', 'company': 'c'}], {'name': 0.2, 'email': 0.5, 'company': 0.3}, 0.5, 1)
Expected Output: []
Explanation: Single record forms a component of size 1; excluding the target leaves nothing.
Input: ([{'id': 1, 'name': 'A', 'email': 'e1', 'company': 'c1'}, {'id': 2, 'name': 'B', 'email': 'e2', 'company': 'c2'}, {'id': 3, 'name': 'C', 'email': 'e3', 'company': 'c3'}], {'name': 0.2, 'email': 0.5, 'company': 0.3}, 0.5, 2)
Expected Output: []
Explanation: No two records share enough; the target's component is just itself.
Input: ([{'id': 1, 'name': 's', 'email': 'a', 'company': 'b'}, {'id': 2, 'name': 's', 'email': 'a', 'company': 'b'}, {'id': 3, 'name': 's', 'email': 'a', 'company': 'b'}], {'name': 0.2, 'email': 0.5, 'company': 0.3}, 1.0, 2)
Expected Output: [1, 3]
Explanation: All three records are identical (similarity 1.0 >= 1.0) so they form one component; from target 2 the rest is [1,3].
Input: ([{'id': 7, 'name': 'a', 'email': 'shared', 'company': 'c7'}, {'id': 8, 'name': 'b', 'email': 'shared', 'company': 'c8'}, {'id': 9, 'name': 'd', 'email': 'other', 'company': 'c9'}], {'name': 0.2, 'email': 0.5, 'company': 0.3}, 0.5, 9)
Expected Output: []
Explanation: Records 7 and 8 share email (a component of their own), but the target 9 shares nothing, so its component is empty after excluding itself.
Hints
- Build the same direct-link adjacency as Part 2.
- Run BFS or DFS from the target; any node you can reach belongs to its connected component.
- Mark nodes visited as you enqueue them to avoid revisiting in cyclic graphs, then drop the target from the final set.