Deduplication And Entity Resolution
Asked of: Data Scientist
Last updated

What's being tested
Ability to design, evaluate, and operationalize scalable record linkage: choosing matching features/algorithms, handling blocking/indexing tradeoffs, clustering matches, and measuring precision/recall under scale and privacy constraints.
Core knowledge
- Fellegi–Sunter probabilistic record linkage: match/non-match likelihood ratios and thresholding.
- String similarity: Levenshtein, Jaro–Winkler, TF-IDF + cosine for names/addresses.
- Blocking/indexing: sorted-neighborhood, canopy clustering, LSH/MinHash to reduce pair comparisons.
- Clustering & transitive closure: connected components, hierarchical agglomerative, graph-based dedupe.
- Learning-to-match: pairwise classifiers (logistic, GBDT) and active labeling for imbalanced labels.
- Evaluation: pairwise precision/recall/F1 and cluster-level metrics; calibration via ROC/PR and business-cost curves.
- Scale & infra: Spark joins, GraphFrames connectedComponents, blocking keys to limit shuffles.
- Privacy/PII: salted hashing, k-anonymity tradeoffs, differential privacy considerations.
Worked example — "Deduplicate user profiles across devices" (framing)
Start by defining the business match: what counts as "same person" (cross-device cookies, partial PII). Create a labeled sample via deterministic rules and human review for edge cases. Design blocking keys that balance recall (broad name+email hash) and pair reduction (device type, country). Engineer features: name/address similarity, shared IPs, device fingerprint overlap, temporal co-occurrence, and behavioral embeddings. Train a pairwise scorer, then cluster matches using graph connected components with a conservative threshold; validate using pairwise and cluster metrics and iterate blocking/model to meet recall/precision targets for downstream attribution.
A common pitfall
Candidates often optimize a pairwise threshold without addressing clustering/transitivity: two records A≈B and B≈C may be linked but A and C differ greatly, creating false clusters. Another trap is tuning for precision only (to avoid false merges) which destroys recall and biases downstream analytics. Also don't ignore blocking errors — missed blocks are irrecoverable unless reblocked or augmented with fallback fuzzy joins.
Further reading
- W. E. Winkler, "Overview of Record Linkage and Current Research Directions," 2006 (survey touching Fellegi–Sunter applications).
- C. Christen, "Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection," 2012 (practical algorithms and evaluation).