This question evaluates a data scientist's competency in K-means clustering, covering core algorithmic assumptions, initialization effects, methods for selecting K, preprocessing needs for scaling and outliers, and business-focused post hoc validation of segments.
Context: You are clustering customer data with numeric features (e.g., RFM, engagement, product usage) to build marketing segments. Assume standard K-means (Euclidean distance) unless noted.
(a) Compare random initialization vs. k-means++ and discuss their impact on convergence and solution quality.
(b) Provide two methods to choose K (from silhouette, elbow, BIC). Explain how and why these methods can fail under non-spherical clusters or clusters with different densities/sizes.
(c) Given feature scaling issues and outliers, propose concrete preprocessing steps before running K-means.
(d) Describe how you would evaluate whether the clusters are useful for a marketing segmentation problem, including business-oriented post hoc validation beyond internal clustering metrics.
Login required