You are given a directed referral graph where each user may have at most one referrer. Input file 'referrals.csv' has columns: user_id (INT), referred_by (INT, nullable). Assumptions: up to 1e6 users; referred_by may reference a user_id not present in the file; rows may contain duplicates; self-referrals and multi-node cycles may exist.
Write Python code to accomplish all of the following:
-
Implement chain(u) that returns the referral chain for user u from the earliest ancestor (root) to u as a list of user_ids. If any cycle is encountered on the path, detect it and return both: (a) the simple cycle nodes in encounter order, and (b) the acyclic prefix leading into the cycle; avoid infinite loops.
-
In O(n) time and O(n) memory overall, compute for every user: (root_ancestor[u], chain_depth[u]) where chain_depth[u] is the number of unique referrers on the path to the root. Do not recompute paths from scratch per user; use memoization/union-find/DFS with coloring (your choice) and justify complexity.
-
Return the top-3 longest valid acyclic chains as lists. Break ties by smaller root_ancestor id, then lexicographically smaller full chain list.
-
Robustness: describe (and implement) preprocessing to (a) deduplicate rows, keeping the earliest seen parent for a user if duplicates conflict, (b) normalize null/empty referred_by, (c) tolerate referred_by not found in user_id (treat as external root), and (d) flag self-referrals.
-
Provide minimal unit tests that cover: acyclic chain, self-cycle, and a 2-node cycle.
Tiny example (use exactly these rows for tests):
referrals.csv
user_id,referred_by
1,
2,1
3,1
4,2
5,4
6,5
7,7
8,9
9,8
Expected behaviors to assert (do not print answers here):
-
chain(6) traverses 1→2→4→5→6; depth(6)=4; root=1
-
chain(7) reports a self-cycle at 7
-
chain(8) reports the 2-node cycle {8,9}.