Normalize City Names for Vote Aggregation
Context
You have voting records containing a free-text city field. The same city may appear in many forms (e.g., "NYC", "New York", "New York City"), and you must aggregate votes by canonical city reliably.
Task
Design an approach to cluster or normalize city-name variants into canonical entities so votes aggregate correctly.
Describe:
-
A rules-based approach
-
Token normalization, abbreviation expansion, fuzzy matching, phonetic keys, blocking/candidate generation.
-
A learning-based approach
-
Pairwise matching models and/or vector-embedding retrieval with re-ranking.
-
Similarity threshold selection
-
How to set, calibrate, and operate with high-confidence auto-accept/auto-reject bands.
-
Handling ambiguous names
-
e.g., multiple "Springfield" candidates.
-
Evaluation and maintenance
-
Metrics, validation, human-in-the-loop, monitoring drift, and updating the mapping over time.
Assume you can use authoritative gazetteers (e.g., national census/OSM/GeoNames) that list canonical city IDs, names, alternative names, and geographies (state/county/country), and that some contextual fields (e.g., state, ZIP) may be present in the voting data.