Case Study: Spectral Clustering Implementation
This chapter documents the complete EXTREME TDD implementation of aprender's Spectral Clustering algorithm for graph-based clustering from Issue #19.
Background
GitHub Issue #19: Implement Spectral Clustering for Non-Convex Clustering
Requirements:
- Graph-based clustering using eigendecomposition
- Affinity matrix construction (RBF and k-NN)
- Normalized graph Laplacian
- Eigendecomposition for embedding
- K-Means clustering in eigenspace
- Parameters: n_clusters, affinity, gamma, n_neighbors
Initial State:
- Tests: 628 passing
- Existing clustering: K-Means, DBSCAN, Hierarchical, GMM
- No graph-based clustering
Implementation Summary
RED Phase
Created 12 comprehensive tests covering:
- Constructor and basic fitting
- Predict method and labels consistency
- Non-convex cluster shapes (moon-shaped clusters)
- RBF affinity matrix
- K-NN affinity matrix
- Gamma parameter effects
- Multiple clusters (3 clusters)
- Error handling (predict before fit)
GREEN Phase
Implemented complete Spectral Clustering algorithm (352 lines):
Core Components:
-
Affinity Enum: RBF (Gaussian kernel) and KNN (k-nearest neighbors graph)
-
SpectralClustering: Public API with builder pattern
- with_affinity, with_gamma, with_n_neighbors
- fit/predict/is_fitted methods
-
Affinity Matrix Construction:
- RBF:
W[i,j] = exp(-gamma * ||x_i - x_j||^2) - K-NN: Connect each point to k nearest neighbors, symmetrize
- RBF:
-
Graph Laplacian: Normalized Laplacian
L = I - D^(-1/2) * W * D^(-1/2)- D is degree matrix (diagonal)
- Provides better numerical properties than unnormalized Laplacian
-
Eigendecomposition (
compute_embedding):- Extract k smallest eigenvectors using nalgebra
- Sort eigenvalues to find smallest k
- Build embedding matrix in row-major order
-
Row Normalization: Critical for normalized spectral clustering
- Normalize each row of embedding to unit length
- Improves cluster separation in eigenspace
-
K-Means Clustering: Final clustering in eigenspace
Key Algorithm Steps:
1. Construct affinity matrix W (RBF or k-NN)
2. Compute degree matrix D
3. Compute normalized Laplacian L = I - D^(-1/2) * W * D^(-1/2)
4. Find k smallest eigenvectors of L
5. Normalize rows of eigenvector matrix
6. Apply K-Means clustering in eigenspace
REFACTOR Phase
- Fixed unnecessary type cast warning
- Zero clippy warnings
- Exported Affinity and SpectralClustering in prelude
- Comprehensive documentation
- Example demonstrating RBF vs K-NN affinity
Final State:
- Tests: 628 → 640 (+12)
- Zero warnings
- All quality gates passing
Algorithm Details
Spectral Clustering:
- Uses graph theory to find clusters
- Analyzes spectrum (eigenvalues) of graph Laplacian
- Effective for non-convex cluster shapes
- Based on graph cut optimization
Time Complexity: O(n² + n³)
- O(n²) for affinity matrix construction
- O(n³) for eigendecomposition
- Dominated by eigendecomposition
Space Complexity: O(n²)
- Affinity matrix storage
- Laplacian matrix storage
RBF Affinity:
W[i,j] = exp(-gamma * ||x_i - x_j||^2)
- Gamma controls locality (higher = more local)
- Full connectivity (dense graph)
- Good for globular clusters
K-NN Affinity:
W[i,j] = 1 if j in k-NN(i), 0 otherwise
Symmetrize: W[i,j] = max(W[i,j], W[j,i])
- Sparse connectivity
- Better for non-convex shapes
- Parameter k controls graph density
Normalized Graph Laplacian:
L = I - D^(-1/2) * W * D^(-1/2)
Where D is the degree matrix (diagonal, D[i,i] = sum of row i of W).
Parameters
-
n_clusters (required): Number of clusters to find
-
affinity (default: RBF): Affinity matrix type
- RBF: Gaussian kernel, good for globular clusters
- KNN: k-nearest neighbors, good for non-convex shapes
-
gamma (default: 1.0): RBF kernel coefficient
- Higher gamma: More local similarity
- Lower gamma: More global similarity
- Only used for RBF affinity
-
n_neighbors (default: 10): Number of neighbors for k-NN graph
- Smaller k: Sparser graph, more clusters
- Larger k: Denser graph, fewer clusters
- Only used for KNN affinity
Example Highlights
The example demonstrates:
- Basic RBF affinity clustering
- K-NN affinity for chain-like clusters
- Gamma parameter effects (0.1, 1.0, 5.0)
- Multiple clusters (k=3)
- Spectral Clustering vs K-Means comparison
- Affinity matrix interpretation
Key Takeaways
- Graph-Based: Uses graph theory and eigendecomposition
- Non-Convex: Handles non-convex cluster shapes better than K-Means
- Affinity Choice: RBF for globular, K-NN for non-convex
- Row Normalization: Critical step after eigendecomposition
- Eigenvalue Sorting: Must sort eigenvalues to find smallest k
- Computational Cost: O(n³) eigendecomposition limits scalability
Comparison: Spectral vs K-Means
| Feature | Spectral Clustering | K-Means |
|---|---|---|
| Cluster Shape | Non-convex, arbitrary | Convex, spherical |
| Complexity | O(n³) | O(nki) |
| Scalability | Small to medium | Large datasets |
| Parameters | n_clusters, affinity, gamma/k | n_clusters, max_iter |
| Graph Structure | Yes (via affinity) | No |
| Initialization | Deterministic (eigenvectors) | Random (k-means++) |
When to use Spectral Clustering:
- Data has non-convex cluster shapes
- Clusters have varying densities
- Data has graph structure
- Dataset is small-to-medium sized
When to use K-Means:
- Clusters are roughly spherical
- Dataset is large (millions of points)
- Speed is critical
- Cluster sizes are similar
Use Cases
- Image Segmentation: Segment images by pixel similarity
- Social Network Analysis: Find communities in social graphs
- Document Clustering: Group documents by content similarity
- Gene Expression Analysis: Cluster genes with similar expression patterns
- Anomaly Detection: Identify outliers via cluster membership
Testing Strategy
Unit Tests (12 implemented):
- Correctness: Separates well-separated clusters
- API contracts: Panic before fit, return expected types
- Parameters: affinity, gamma, n_neighbors effects
- Edge cases: Multiple clusters, non-convex shapes
Property-Based Tests (future work):
- Connected components: k eigenvalues near 0 → k clusters
- Affinity symmetry: W[i,j] = W[j,i]
- Laplacian positive semi-definite
Technical Challenges Solved
Challenge 1: Eigenvalue Ordering
Problem: nalgebra's SymmetricEigen doesn't sort eigenvalues. Solution: Manual sorting of eigenvalue-index pairs, take k smallest indices.
Challenge 2: Row-Major vs Column-Major
Problem: Embedding matrix constructed in column-major order but Matrix expects row-major. Solution: Iterate rows first, then columns when extracting eigenvectors.
Challenge 3: Row Normalization
Problem: Without row normalization, clustering quality was poor. Solution: Normalize each row of embedding matrix to unit length.
Challenge 4: Concentric Circles
Problem: Original test used concentric circles, fundamentally challenging for spectral clustering. Solution: Replaced with more realistic moon-shaped clusters.
Related Topics
References
- Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. NIPS.
- Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing, 17(4), 395-416.
- Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE TPAMI.