Choose clustering vs regression; explain KNN
Company: Thumbtack
Role: Data Scientist
Category: Machine Learning
Difficulty: Medium
Interview Round: Onsite
When would you use clustering vs. regression on a business problem with partially labeled outcomes? Specify the decision criteria (label availability, objective, evaluation metrics, cost of errors). Enumerate at least four clustering algorithms (K-Means, Hierarchical/Agglomerative, DBSCAN/HDBSCAN, Gaussian Mixture Models) and compare assumptions, key hyperparameters, scalability, distance metrics, and failure modes (e.g., non-spherical clusters, varying density, high-dimensional sparsity, mixed data types). Give concrete scenarios selecting DBSCAN over K-Means and vice versa. Finally, explain K-Nearest Neighbors to a non-technical stakeholder with a real-world analogy, then deepen: choosing k, weighting by distance, effects of feature scaling, curse of dimensionality, and how to deploy KNN efficiently (KD-tree/ball-tree, approximate neighbors).
Quick Answer: This question evaluates model selection between clustering and regression given partially labeled outcomes, familiarity with clustering algorithms (K-Means, Hierarchical/Agglomerative, DBSCAN/HDBSCAN, Gaussian Mixture Models), K-Nearest Neighbors behavior, evaluation metrics, hyperparameters, and deployment considerations for a Data Scientist role in Machine Learning. It is commonly asked to assess judgment on supervised versus unsupervised approaches, trade-offs driven by label availability, objective and cost of errors, algorithmic assumptions, scalability, and practical deployment techniques, and the level spans both conceptual understanding and practical application.