Case Study: DBSCAN Clustering Implementation
This chapter documents the complete EXTREME TDD implementation of aprender's DBSCAN clustering algorithm. This is a real-world example showing every phase of the RED-GREEN-REFACTOR cycle from Issue #14.
Background
GitHub Issue #14: Implement DBSCAN clustering algorithm
Requirements:
- Density-based clustering without requiring k specification
- Automatic outlier detection (noise points labeled as -1)
epsparameter for neighborhood radiusmin_samplesparameter for core point density threshold- Integration with
UnsupervisedEstimatortrait - Deterministic clustering results
- Comprehensive example demonstrating parameter effects
Initial State:
- Tests: 548 passing
- Existing clustering: K-Means only
- No density-based clustering support
CYCLE 1: Core DBSCAN Algorithm
RED Phase
Created 12 comprehensive tests in src/cluster/mod.rs:
#[test]
fn test_dbscan_new() {
let dbscan = DBSCAN::new(0.5, 3);
assert_eq!(dbscan.eps(), 0.5);
assert_eq!(dbscan.min_samples(), 3);
assert!(!dbscan.is_fitted());
}
#[test]
fn test_dbscan_fit_basic() {
let data = Matrix::from_vec(
6,
2,
vec![
1.0, 1.0, 1.1, 1.0, 1.0, 1.1,
5.0, 5.0, 5.1, 5.0, 5.0, 5.1,
],
)
.unwrap();
let mut dbscan = DBSCAN::new(0.5, 2);
dbscan.fit(&data).unwrap();
assert!(dbscan.is_fitted());
}
Additional tests covered:
- Cluster prediction consistency
- Noise detection for outliers
- Single cluster scenarios
- All-noise scenarios
- Parameter sensitivity (eps and min_samples)
- Reproducibility
- Error handling (predict before fit)
Verification:
$ cargo test dbscan
error[E0422]: cannot find struct `DBSCAN` in this scope
Result: 12 tests failing ✅ (expected - DBSCAN doesn't exist)
GREEN Phase
Implemented minimal DBSCAN algorithm in src/cluster/mod.rs:
/// DBSCAN (Density-Based Spatial Clustering of Applications with Noise).
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct DBSCAN {
eps: f32,
min_samples: usize,
labels: Option<Vec<i32>>,
}
impl DBSCAN {
pub fn new(eps: f32, min_samples: usize) -> Self {
Self {
eps,
min_samples,
labels: None,
}
}
fn region_query(&self, x: &Matrix<f32>, i: usize) -> Vec<usize> {
let mut neighbors = Vec::new();
let n_samples = x.shape().0;
for j in 0..n_samples {
let dist = self.euclidean_distance(x, i, j);
if dist <= self.eps {
neighbors.push(j);
}
}
neighbors
}
fn euclidean_distance(&self, x: &Matrix<f32>, i: usize, j: usize) -> f32 {
let n_features = x.shape().1;
let mut sum = 0.0;
for k in 0..n_features {
let diff = x.get(i, k) - x.get(j, k);
sum += diff * diff;
}
sum.sqrt()
}
fn expand_cluster(
&self,
x: &Matrix<f32>,
labels: &mut [i32],
point: usize,
neighbors: &mut Vec<usize>,
cluster_id: i32,
) {
labels[point] = cluster_id;
let mut i = 0;
while i < neighbors.len() {
let neighbor = neighbors[i];
if labels[neighbor] == -2 {
labels[neighbor] = cluster_id;
let neighbor_neighbors = self.region_query(x, neighbor);
if neighbor_neighbors.len() >= self.min_samples {
for &nn in &neighbor_neighbors {
if !neighbors.contains(&nn) {
neighbors.push(nn);
}
}
}
} else if labels[neighbor] == -1 {
labels[neighbor] = cluster_id;
}
i += 1;
}
}
}
impl UnsupervisedEstimator for DBSCAN {
type Labels = Vec<i32>;
fn fit(&mut self, x: &Matrix<f32>) -> Result<()> {
let n_samples = x.shape().0;
let mut labels = vec![-2; n_samples]; // -2 = unlabeled
let mut cluster_id = 0;
for i in 0..n_samples {
if labels[i] != -2 {
continue;
}
let mut neighbors = self.region_query(x, i);
if neighbors.len() < self.min_samples {
labels[i] = -1;
continue;
}
self.expand_cluster(x, &mut labels, i, &mut neighbors, cluster_id);
cluster_id += 1;
}
self.labels = Some(labels);
Ok(())
}
fn predict(&self, _x: &Matrix<f32>) -> Self::Labels {
self.labels().clone()
}
}
Verification:
$ cargo test
test result: ok. 560 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Result: All 560 tests passing ✅ (12 new DBSCAN tests)
REFACTOR Phase
Code Quality:
- Fixed clippy warnings (unused variables, map_clone, manual_contains)
- Added comprehensive documentation
- Exported DBSCAN in prelude for easy access
- Added public getter methods (eps, min_samples, is_fitted, labels)
Verification:
$ cargo clippy --all-targets -- -D warnings
Finished `dev` profile [unoptimized + debuginfo] target(s) in 1.53s
Result: Zero clippy warnings ✅
Example Implementation
Created comprehensive example examples/dbscan_clustering.rs demonstrating:
- Standard DBSCAN clustering - Basic usage with 2 clusters and noise
- Effect of eps parameter - Shows how neighborhood radius affects clustering
- Effect of min_samples parameter - Demonstrates density threshold impact
- Comparison with K-Means - Highlights DBSCAN's outlier detection advantage
- Anomaly detection use case - Practical application for identifying outliers
Key differences from K-Means:
- K-Means: requires specifying k, assigns all points to clusters
- DBSCAN: discovers k automatically, identifies outliers as noise
Run the example:
cargo run --example dbscan_clustering
Algorithm Details
Time Complexity: O(n²) for naive distance computations Space Complexity: O(n) for storing labels
Core Concepts:
- Core points: Points with ≥ min_samples neighbors within eps
- Border points: Non-core points within eps of a core point
- Noise points: Points neither core nor border (labeled -1)
- Cluster expansion: Recursive growth from core points to reachable neighbors
Final State
Tests: 560 passing (548 → 560, +12 DBSCAN tests)
Coverage: All DBSCAN functionality comprehensively tested
Quality: Zero clippy warnings, full documentation
Exports: Available via use aprender::prelude::*;
Key Takeaways
- EXTREME TDD works: Tests written first caught edge cases early
- Algorithm correctness: Comprehensive tests verify all scenarios
- Quality gates: Clippy and formatting ensure consistent code style
- Documentation: Example demonstrates practical usage and parameter tuning