Case Study: Local Outlier Factor (LOF) Implementation

This chapter documents the complete EXTREME TDD implementation of aprender's Local Outlier Factor algorithm for density-based anomaly detection from Issue #20.

Background

GitHub Issue #20: Implement Local Outlier Factor (LOF) for Anomaly Detection

Requirements:

Density-based anomaly detection using local reachability density
Detects outliers in varying density regions
Parameters: n_neighbors, contamination
Methods: fit(), predict(), score_samples(), negative_outlier_factor()
LOF score interpretation: ≈1 = normal, >>1 = outlier

Initial State:

Tests: 612 passing
Existing anomaly detection: Isolation Forest
No density-based anomaly detection

Implementation Summary

RED Phase

Created 16 comprehensive tests covering:

Constructor and basic fitting
LOF score calculation (higher = more anomalous)
Anomaly prediction (1=normal, -1=anomaly)
Contamination parameter (10%, 20%, 30%)
n_neighbors parameter (local vs global context)
Varying density clusters (key LOF advantage)
negative_outlier_factor() for sklearn compatibility
Error handling (predict/score before fit)
Multidimensional data
Edge cases (all normal points)

GREEN Phase

Implemented complete LOF algorithm (352 lines):

Core Components:

LocalOutlierFactor: Public API
- Builder pattern (with_n_neighbors, with_contamination)
- fit/predict/score_samples methods
- negative_outlier_factor for sklearn compatibility
k-NN Search (compute_knn):
- Brute-force distance computation
- Sort by distance
- Extract k nearest neighbors
Reachability Distance (reachability_distance):
- max(distance(A,B), k-distance(B))
- Smooths density estimation
Local Reachability Density (compute_lrd):
- LRD(A) = k / Σ(reachability_distance(A, neighbor))
- Inverse of average reachability distance
LOF Score (compute_lof_scores):
- LOF(A) = avg(LRD(neighbors)) / LRD(A)
- Ratio of neighbor density to point density

Key Algorithm Steps:

Fit:
- Compute k-NN for all training points
- Compute LRD for all points
- Compute LOF scores
- Determine threshold from contamination
Predict:
- Compute k-NN for query points against training
- Compute LRD for query points
- Compute LOF scores for query points
- Apply threshold: LOF > threshold → anomaly

REFACTOR Phase

Removed unused variables
Zero clippy warnings
Exported in prelude
Comprehensive documentation
Varying density example showcasing LOF's key advantage

Final State:

Tests: 612 → 628 (+16)
Zero warnings
All quality gates passing

Algorithm Details

Local Outlier Factor:

Compares local density to neighbors' densities
Key advantage: Works with varying density regions
LOF score interpretation:
- LOF ≈ 1: Similar density (normal)
- LOF >> 1: Lower density (outlier)
- LOF < 1: Higher density (core point)

Time Complexity: O(n² log k)

n = samples, k = n_neighbors
Dominated by k-NN search

Space Complexity: O(n²)

Distance matrix and k-NN storage

Reachability Distance:

reach_dist(A, B) = max(dist(A, B), k_dist(B))

Where k_dist(B) is distance to B's k-th neighbor.

Local Reachability Density:

LRD(A) = k / Σ_i reach_dist(A, neighbor_i)

LOF Score:

LOF(A) = (Σ_i LRD(neighbor_i)) / (k * LRD(A))

Parameters

n_neighbors (default: 20): Number of neighbors for density estimation
- Smaller k: More local, sensitive to local outliers
- Larger k: More global context, stable but may miss local anomalies
contamination (default: 0.1): Expected anomaly proportion
- Range: 0.0 to 0.5
- Sets classification threshold

Example Highlights

The example demonstrates:

Basic anomaly detection
LOF score interpretation (≈1 vs >>1)
Varying density clusters (LOF's key advantage)
n_neighbors parameter effects
Contamination parameter
LOF vs Isolation Forest comparison
negative_outlier_factor for sklearn compatibility
Reproducibility

Key Takeaways

Density-Based: LOF compares local densities, not global isolation
Varying Density: Excels where clusters have different densities
Interpretable Scores: LOF score has clear meaning
Local Context: n_neighbors controls locality
Complementary: Works well alongside Isolation Forest
No Distance Metric Bias: Uses relative densities

Comparison: LOF vs Isolation Forest

Feature	LOF	Isolation Forest
Approach	Density-based	Isolation-based
Varying Density	Excellent	Good
Global Outliers	Good	Excellent
Training Time	O(n²)	O(n log m)
Parameter Tuning	n_neighbors	n_estimators, max_samples
Interpretability	High (density ratio)	Medium (path length)

When to use LOF:

Data has regions with different densities
Need to detect local outliers
Want interpretable density-based scores

When to use Isolation Forest:

Large datasets (faster training)
Global outliers more important
Don't know density structure

Best practice: Use both and ensemble the results!

Use Cases

Fraud Detection: Transactions with unusual patterns relative to user's history
Network Security: Anomalous traffic in varying load conditions
Manufacturing: Defects in varying production speeds
Sensor Networks: Faulty sensors in varying environmental conditions
Medical Diagnosis: Unusual patient metrics relative to demographic group

Testing Strategy

Unit Tests (16 implemented):

Correctness: Detects clear outliers in varying densities
API contracts: Panic before fit, return expected types
Parameters: n_neighbors, contamination effects
Edge cases: All normal, all anomalous, small k

Property-Based Tests (future work):

LOF ≈ 1 for uniform density
LOF monotonic in isolation degree
Consistency: Same k → consistent relative ordering

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning