Case Study: Local Outlier Factor (LOF) Implementation
This chapter documents the complete EXTREME TDD implementation of aprender's Local Outlier Factor algorithm for density-based anomaly detection from Issue #20.
Background
GitHub Issue #20: Implement Local Outlier Factor (LOF) for Anomaly Detection
Requirements:
- Density-based anomaly detection using local reachability density
- Detects outliers in varying density regions
- Parameters: n_neighbors, contamination
- Methods: fit(), predict(), score_samples(), negative_outlier_factor()
- LOF score interpretation: ≈1 = normal, >>1 = outlier
Initial State:
- Tests: 612 passing
- Existing anomaly detection: Isolation Forest
- No density-based anomaly detection
Implementation Summary
RED Phase
Created 16 comprehensive tests covering:
- Constructor and basic fitting
- LOF score calculation (higher = more anomalous)
- Anomaly prediction (1=normal, -1=anomaly)
- Contamination parameter (10%, 20%, 30%)
- n_neighbors parameter (local vs global context)
- Varying density clusters (key LOF advantage)
- negative_outlier_factor() for sklearn compatibility
- Error handling (predict/score before fit)
- Multidimensional data
- Edge cases (all normal points)
GREEN Phase
Implemented complete LOF algorithm (352 lines):
Core Components:
-
LocalOutlierFactor: Public API
- Builder pattern (with_n_neighbors, with_contamination)
- fit/predict/score_samples methods
- negative_outlier_factor for sklearn compatibility
-
k-NN Search (
compute_knn):- Brute-force distance computation
- Sort by distance
- Extract k nearest neighbors
-
Reachability Distance (
reachability_distance):- max(distance(A,B), k-distance(B))
- Smooths density estimation
-
Local Reachability Density (
compute_lrd):- LRD(A) = k / Σ(reachability_distance(A, neighbor))
- Inverse of average reachability distance
-
LOF Score (
compute_lof_scores):- LOF(A) = avg(LRD(neighbors)) / LRD(A)
- Ratio of neighbor density to point density
Key Algorithm Steps:
-
Fit:
- Compute k-NN for all training points
- Compute LRD for all points
- Compute LOF scores
- Determine threshold from contamination
-
Predict:
- Compute k-NN for query points against training
- Compute LRD for query points
- Compute LOF scores for query points
- Apply threshold: LOF > threshold → anomaly
REFACTOR Phase
- Removed unused variables
- Zero clippy warnings
- Exported in prelude
- Comprehensive documentation
- Varying density example showcasing LOF's key advantage
Final State:
- Tests: 612 → 628 (+16)
- Zero warnings
- All quality gates passing
Algorithm Details
Local Outlier Factor:
- Compares local density to neighbors' densities
- Key advantage: Works with varying density regions
- LOF score interpretation:
- LOF ≈ 1: Similar density (normal)
- LOF >> 1: Lower density (outlier)
- LOF < 1: Higher density (core point)
Time Complexity: O(n² log k)
- n = samples, k = n_neighbors
- Dominated by k-NN search
Space Complexity: O(n²)
- Distance matrix and k-NN storage
Reachability Distance:
reach_dist(A, B) = max(dist(A, B), k_dist(B))
Where k_dist(B) is distance to B's k-th neighbor.
Local Reachability Density:
LRD(A) = k / Σ_i reach_dist(A, neighbor_i)
LOF Score:
LOF(A) = (Σ_i LRD(neighbor_i)) / (k * LRD(A))
Parameters
-
n_neighbors (default: 20): Number of neighbors for density estimation
- Smaller k: More local, sensitive to local outliers
- Larger k: More global context, stable but may miss local anomalies
-
contamination (default: 0.1): Expected anomaly proportion
- Range: 0.0 to 0.5
- Sets classification threshold
Example Highlights
The example demonstrates:
- Basic anomaly detection
- LOF score interpretation (≈1 vs >>1)
- Varying density clusters (LOF's key advantage)
- n_neighbors parameter effects
- Contamination parameter
- LOF vs Isolation Forest comparison
- negative_outlier_factor for sklearn compatibility
- Reproducibility
Key Takeaways
- Density-Based: LOF compares local densities, not global isolation
- Varying Density: Excels where clusters have different densities
- Interpretable Scores: LOF score has clear meaning
- Local Context: n_neighbors controls locality
- Complementary: Works well alongside Isolation Forest
- No Distance Metric Bias: Uses relative densities
Comparison: LOF vs Isolation Forest
| Feature | LOF | Isolation Forest |
|---|---|---|
| Approach | Density-based | Isolation-based |
| Varying Density | Excellent | Good |
| Global Outliers | Good | Excellent |
| Training Time | O(n²) | O(n log m) |
| Parameter Tuning | n_neighbors | n_estimators, max_samples |
| Interpretability | High (density ratio) | Medium (path length) |
When to use LOF:
- Data has regions with different densities
- Need to detect local outliers
- Want interpretable density-based scores
When to use Isolation Forest:
- Large datasets (faster training)
- Global outliers more important
- Don't know density structure
Best practice: Use both and ensemble the results!
Use Cases
- Fraud Detection: Transactions with unusual patterns relative to user's history
- Network Security: Anomalous traffic in varying load conditions
- Manufacturing: Defects in varying production speeds
- Sensor Networks: Faulty sensors in varying environmental conditions
- Medical Diagnosis: Unusual patient metrics relative to demographic group
Testing Strategy
Unit Tests (16 implemented):
- Correctness: Detects clear outliers in varying densities
- API contracts: Panic before fit, return expected types
- Parameters: n_neighbors, contamination effects
- Edge cases: All normal, all anomalous, small k
Property-Based Tests (future work):
- LOF ≈ 1 for uniform density
- LOF monotonic in isolation degree
- Consistency: Same k → consistent relative ordering