K-means Clustering: Concepts, Initialization, Model Selection, Preprocessing, and Business Validation
Context: You are clustering customer data with numeric features (e.g., RFM, engagement, product usage) to build marketing segments. Assume standard K-means (Euclidean distance) unless noted.
-
Explain K-means and its core assumptions.
(a) Compare random initialization vs. k-means++ and discuss their impact on convergence and solution quality.
(b) Provide two methods to choose K (from silhouette, elbow, BIC). Explain how and why these methods can fail under non-spherical clusters or clusters with different densities/sizes.
(c) Given feature scaling issues and outliers, propose concrete preprocessing steps before running K-means.
(d) Describe how you would evaluate whether the clusters are useful for a marketing segmentation problem, including business-oriented post hoc validation beyond internal clustering metrics.