Deduplication And Entity Resolution — Tech Interview Concept

What's being tested
Ability to design, evaluate, and operationalize scalable record linkage: choosing matching features/algorithms, handling blocking/indexing tradeoffs, clustering matches, and measuring precision/recall under scale and privacy constraints.

Core knowledge

Fellegi–Sunter probabilistic record linkage: match/non-match likelihood ratios and thresholding.
String similarity: Levenshtein, Jaro–Winkler, TF-IDF + cosine for names/addresses.
Blocking/indexing: sorted-neighborhood, canopy clustering, LSH/MinHash to reduce pair comparisons.
Clustering & transitive closure: connected components, hierarchical agglomerative, graph-based dedupe.
Learning-to-match: pairwise classifiers (logistic, GBDT) and active labeling for imbalanced labels.
Evaluation: pairwise precision/recall/F1 and cluster-level metrics; calibration via ROC/PR and business-cost curves.
Scale & infra: Spark joins, GraphFrames connectedComponents, blocking keys to limit shuffles.
Privacy/PII: salted hashing, k-anonymity tradeoffs, differential privacy considerations.

Worked example — "Deduplicate user profiles across devices" (framing)
Start by defining the business match: what counts as "same person" (cross-device cookies, partial PII). Create a labeled sample via deterministic rules and human review for edge cases. Design blocking keys that balance recall (broad name+email hash) and pair reduction (device type, country). Engineer features: name/address similarity, shared IPs, device fingerprint overlap, temporal co-occurrence, and behavioral embeddings. Train a pairwise scorer, then cluster matches using graph connected components with a conservative threshold; validate using pairwise and cluster metrics and iterate blocking/model to meet recall/precision targets for downstream attribution.

A common pitfall
Candidates often optimize a pairwise threshold without addressing clustering/transitivity: two records A≈B and B≈C may be linked but A and C differ greatly, creating false clusters. Another trap is tuning for precision only (to avoid false merges) which destroys recall and biases downstream analytics. Also don't ignore blocking errors — missed blocks are irrecoverable unless reblocked or augmented with fallback fuzzy joins.

Related concepts