Which clustering algorithm would you use and why
Company: Meta
Role: Data Scientist
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
##### Question
You need to cluster users for a social product (e.g. Meta) to discover meaningful groups such as communities, interest groups, or usage segments. The data you have may be either, or both, of:
- A **user feature table** — dense numeric/categorical features per user (age bucket, country, activity rate, topics engaged, embeddings, etc.).
- A **social network graph** — nodes = users, edges = friendships / follows / messages / interactions, possibly **weighted and directed**.
Answer the following:
1. **Traditional (feature-vector) clustering.** Which clustering algorithms would you consider (e.g. k-means, GMM, hierarchical, DBSCAN/HDBSCAN) and how would you choose among them? Describe preprocessing, distance/similarity choices, how you would pick the number of clusters, and how you would evaluate cluster quality.
2. **Social network / graph clustering.** If the core data is a social graph instead, what algorithms would you use for community detection, and how does this differ fundamentally from clustering a feature matrix?
3. **Directed and weighted graphs.** How do you handle direction and edge weights in graph clustering?
4. **Hybrid.** How would you combine graph structure and user features when both are available?
5. **Choosing the number of clusters and evaluating quality.** What metrics and validation strategy would you use for both the feature-vector and the graph case?
6. **Scale and operations.** What practical issues arise at millions of users (compute, dynamic graphs, cold-start, drift) and how would you handle them?
Quick Answer: A Meta Data Scientist machine learning screen on choosing a clustering algorithm for social-product users. It contrasts traditional feature-vector clustering (k-means, GMM, hierarchical, DBSCAN/HDBSCAN) with social-graph community detection (Louvain/Leiden, spectral, SBM, node embeddings), and covers preprocessing, choosing the number of clusters, evaluation, directed/weighted graphs, and scaling to millions of users.