Case Study: Local Outlier Factor (LOF) Implementation

This chapter documents the complete EXTREME TDD implementation of aprender's Local Outlier Factor algorithm for density-based anomaly detection from Issue #20.

Background

GitHub Issue #20: Implement Local Outlier Factor (LOF) for Anomaly Detection

Requirements:

  • Density-based anomaly detection using local reachability density
  • Detects outliers in varying density regions
  • Parameters: n_neighbors, contamination
  • Methods: fit(), predict(), score_samples(), negative_outlier_factor()
  • LOF score interpretation: ≈1 = normal, >>1 = outlier

Initial State:

  • Tests: 612 passing
  • Existing anomaly detection: Isolation Forest
  • No density-based anomaly detection

Implementation Summary

RED Phase

Created 16 comprehensive tests covering:

  • Constructor and basic fitting
  • LOF score calculation (higher = more anomalous)
  • Anomaly prediction (1=normal, -1=anomaly)
  • Contamination parameter (10%, 20%, 30%)
  • n_neighbors parameter (local vs global context)
  • Varying density clusters (key LOF advantage)
  • negative_outlier_factor() for sklearn compatibility
  • Error handling (predict/score before fit)
  • Multidimensional data
  • Edge cases (all normal points)

GREEN Phase

Implemented complete LOF algorithm (352 lines):

Core Components:

  1. LocalOutlierFactor: Public API

    • Builder pattern (with_n_neighbors, with_contamination)
    • fit/predict/score_samples methods
    • negative_outlier_factor for sklearn compatibility
  2. k-NN Search (compute_knn):

    • Brute-force distance computation
    • Sort by distance
    • Extract k nearest neighbors
  3. Reachability Distance (reachability_distance):

    • max(distance(A,B), k-distance(B))
    • Smooths density estimation
  4. Local Reachability Density (compute_lrd):

    • LRD(A) = k / Σ(reachability_distance(A, neighbor))
    • Inverse of average reachability distance
  5. LOF Score (compute_lof_scores):

    • LOF(A) = avg(LRD(neighbors)) / LRD(A)
    • Ratio of neighbor density to point density

Key Algorithm Steps:

  1. Fit:

    • Compute k-NN for all training points
    • Compute LRD for all points
    • Compute LOF scores
    • Determine threshold from contamination
  2. Predict:

    • Compute k-NN for query points against training
    • Compute LRD for query points
    • Compute LOF scores for query points
    • Apply threshold: LOF > threshold → anomaly

REFACTOR Phase

  • Removed unused variables
  • Zero clippy warnings
  • Exported in prelude
  • Comprehensive documentation
  • Varying density example showcasing LOF's key advantage

Final State:

  • Tests: 612 → 628 (+16)
  • Zero warnings
  • All quality gates passing

Algorithm Details

Local Outlier Factor:

  • Compares local density to neighbors' densities
  • Key advantage: Works with varying density regions
  • LOF score interpretation:
    • LOF ≈ 1: Similar density (normal)
    • LOF >> 1: Lower density (outlier)
    • LOF < 1: Higher density (core point)

Time Complexity: O(n² log k)

  • n = samples, k = n_neighbors
  • Dominated by k-NN search

Space Complexity: O(n²)

  • Distance matrix and k-NN storage

Reachability Distance:

reach_dist(A, B) = max(dist(A, B), k_dist(B))

Where k_dist(B) is distance to B's k-th neighbor.

Local Reachability Density:

LRD(A) = k / Σ_i reach_dist(A, neighbor_i)

LOF Score:

LOF(A) = (Σ_i LRD(neighbor_i)) / (k * LRD(A))

Parameters

  • n_neighbors (default: 20): Number of neighbors for density estimation

    • Smaller k: More local, sensitive to local outliers
    • Larger k: More global context, stable but may miss local anomalies
  • contamination (default: 0.1): Expected anomaly proportion

    • Range: 0.0 to 0.5
    • Sets classification threshold

Example Highlights

The example demonstrates:

  1. Basic anomaly detection
  2. LOF score interpretation (≈1 vs >>1)
  3. Varying density clusters (LOF's key advantage)
  4. n_neighbors parameter effects
  5. Contamination parameter
  6. LOF vs Isolation Forest comparison
  7. negative_outlier_factor for sklearn compatibility
  8. Reproducibility

Key Takeaways

  1. Density-Based: LOF compares local densities, not global isolation
  2. Varying Density: Excels where clusters have different densities
  3. Interpretable Scores: LOF score has clear meaning
  4. Local Context: n_neighbors controls locality
  5. Complementary: Works well alongside Isolation Forest
  6. No Distance Metric Bias: Uses relative densities

Comparison: LOF vs Isolation Forest

FeatureLOFIsolation Forest
ApproachDensity-basedIsolation-based
Varying DensityExcellentGood
Global OutliersGoodExcellent
Training TimeO(n²)O(n log m)
Parameter Tuningn_neighborsn_estimators, max_samples
InterpretabilityHigh (density ratio)Medium (path length)

When to use LOF:

  • Data has regions with different densities
  • Need to detect local outliers
  • Want interpretable density-based scores

When to use Isolation Forest:

  • Large datasets (faster training)
  • Global outliers more important
  • Don't know density structure

Best practice: Use both and ensemble the results!

Use Cases

  1. Fraud Detection: Transactions with unusual patterns relative to user's history
  2. Network Security: Anomalous traffic in varying load conditions
  3. Manufacturing: Defects in varying production speeds
  4. Sensor Networks: Faulty sensors in varying environmental conditions
  5. Medical Diagnosis: Unusual patient metrics relative to demographic group

Testing Strategy

Unit Tests (16 implemented):

  • Correctness: Detects clear outliers in varying densities
  • API contracts: Panic before fit, return expected types
  • Parameters: n_neighbors, contamination effects
  • Edge cases: All normal, all anomalous, small k

Property-Based Tests (future work):

  • LOF ≈ 1 for uniform density
  • LOF monotonic in isolation degree
  • Consistency: Same k → consistent relative ordering