PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Machine Learning/Microsoft

Cluster city name variants into canonical entities

Last updated: Mar 29, 2026

Quick Overview

This question evaluates entity-resolution and record-linkage competencies, focusing on normalization, clustering, and the design of hybrid rule-based and learning-based approaches for mapping messy city-name variants to canonical geographic entities.

  • medium
  • Microsoft
  • Machine Learning
  • Software Engineer

Cluster city name variants into canonical entities

Company: Microsoft

Role: Software Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

City names in the voting data can appear in many forms (e.g., "NYC", "New York", "New York City"). Design an approach to cluster or normalize these variants into canonical entities so votes aggregate correctly. Describe rules-based and learning-based methods (e.g., token normalization, fuzzy matching, phonetic keys, vector embeddings), how to select similarity thresholds, handle ambiguous names (e.g., "Springfield"), and evaluate and maintain the mapping over time.

Quick Answer: This question evaluates entity-resolution and record-linkage competencies, focusing on normalization, clustering, and the design of hybrid rule-based and learning-based approaches for mapping messy city-name variants to canonical geographic entities.

Related Interview Questions

  • How do you choose a model? - Microsoft (medium)
  • Explain SHAP in an ML System - Microsoft (medium)
  • Explain normalization, regularization, CTR, imbalance handling - Microsoft (medium)
  • Clean OCR data and build an LLM dataset - Microsoft (medium)
  • Explain SHAP and build an ML project - Microsoft (easy)
Microsoft logo
Microsoft
Sep 6, 2025, 12:00 AM
Software Engineer
Technical Screen
Machine Learning
3
0

Normalize City Names for Vote Aggregation

Context

You have voting records containing a free-text city field. The same city may appear in many forms (e.g., "NYC", "New York", "New York City"), and you must aggregate votes by canonical city reliably.

Task

Design an approach to cluster or normalize city-name variants into canonical entities so votes aggregate correctly.

Describe:

  1. A rules-based approach
    • Token normalization, abbreviation expansion, fuzzy matching, phonetic keys, blocking/candidate generation.
  2. A learning-based approach
    • Pairwise matching models and/or vector-embedding retrieval with re-ranking.
  3. Similarity threshold selection
    • How to set, calibrate, and operate with high-confidence auto-accept/auto-reject bands.
  4. Handling ambiguous names
    • e.g., multiple "Springfield" candidates.
  5. Evaluation and maintenance
    • Metrics, validation, human-in-the-loop, monitoring drift, and updating the mapping over time.

Assume you can use authoritative gazetteers (e.g., national census/OSM/GeoNames) that list canonical city IDs, names, alternative names, and geographies (state/county/country), and that some contextual fields (e.g., state, ZIP) may be present in the voting data.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Microsoft•More Software Engineer•Microsoft Software Engineer•Microsoft Machine Learning•Software Engineer Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.