Clustering And Graph Community Detection
Asked of: Data Scientist
Last updated

What's being tested
Ability to choose, justify, and evaluate clustering/community-detection methods for graph-structured data under practical constraints: algorithmic tradeoffs (quality vs. scale), preprocessing, and metrics tied to business goals.
Core knowledge
- Modularity and conductance as primary graph-community quality metrics; modularity maximization pitfalls (resolution limit).
- Louvain/Leiden: fast, hierarchical modularity optimization suitable for large graphs; Leiden fixes disconnected communities.
- Label Propagation: near-linear-time, stochastic, good for baseline but noisy and unstable.
- Spectral clustering: eigenvector-based, good quality for small graphs; O(n^3) in naïve form, needs approximations.
- Community detection vs. node clustering: edges define communities; k-means on node features assumes Euclidean clusters.
- Graph embeddings (DeepWalk, node2vec, GraphSAGE) enable downstream clustering but require tuning walk/aggregation hyperparameters.
- Scalability: use distributed frameworks (GraphX, Giraph) or streaming/approximate algorithms and sample giant components carefully.
Worked example
Question: "Design a system to detect communities in a large social graph (hundreds of millions of nodes)." Start by clarifying objectives: offline vs. streaming, desired community size granularity, whether edges are weighted/directed, and SLAs. Define evaluation metrics (modularity, conductance, business KPIs like ad-targeting lift). For a billion-edge graph, propose a two-stage pipeline: lightweight graph preprocessing and sampling (remove low-degree noise, handle giant hubs), then run a scalable algorithm such as Leiden for batch detection, with label-propagation as a fast iterative refresher. Explain fallback: use node embeddings plus clustering when you need attribute-aware communities.
A common pitfall
Choosing an algorithm by name rather than by constraints: e.g., applying k-means to adjacency vectors or using spectral clustering at scale without approximations. This often produces spurious "communities" or is computationally infeasible. Also, optimizing modularity blindly yields merged small but meaningful groups (resolution limit); always tie evaluation back to offline metrics or business experiments.
Further reading
- Blondel et al., “Fast unfolding of communities in large networks” (Louvain)
- Fortunato, “Community detection in graphs” (Physics Reports, 2010)
Related concepts
- Clustered And Networked Experiments
- Graphs, Grids, And Connected ComponentsCoding & Algorithms
- Graph, Grid, And Connectivity AlgorithmsCoding & Algorithms
- Tree And Graph ConnectivityCoding & Algorithms
- Graph Algorithms, Dependency Resolution And ConnectivityCoding & Algorithms
- Topological Sorting And Cycle DetectionCoding & Algorithms