Case Study: Isolation Forest Implementation

This chapter documents the complete EXTREME TDD implementation of aprender's Isolation Forest algorithm for anomaly detection from Issue #17.

Background

GitHub Issue #17: Implement Isolation Forest for Anomaly Detection

Requirements:

Ensemble of isolation trees using random partitioning
O(n log n) training complexity
Parameters: n_estimators, max_samples, contamination
Methods: fit(), predict(), score_samples()
predict() returns 1 for normal, -1 for anomaly
score_samples() returns anomaly scores (lower = more anomalous)
Use cases: fraud detection, network intrusion, quality control

Initial State:

Tests: 596 passing
Existing clustering: K-Means, DBSCAN, Hierarchical, GMM
No anomaly detection support

Implementation Summary

RED Phase

Created 17 comprehensive tests covering:

Constructor and basic fitting
Anomaly prediction (1=normal, -1=anomaly)
Anomaly score computation
Contamination parameter (10%, 20%, 30%)
Number of trees (ensemble size)
Max samples (subsample size)
Reproducibility with random seeds
Multidimensional data (3+ features)
Path length calculations
Decision function consistency
Error handling (predict/score before fit)
Edge cases (all normal points)

GREEN Phase

Implemented complete Isolation Forest (387 lines):

Core Components:

IsolationNode: Binary tree node structure
- Split feature and value
- Left/right children (Box for recursion)
- Node size (for path length calculation)
IsolationTree: Single isolation tree
- build_tree(): Recursive random partitioning
- path_length(): Compute isolation path length
- c(n): Average BST path length for normalization
IsolationForest: Public API
- Ensemble of isolation trees
- Builder pattern (with_* methods)
- fit(): Train ensemble on subsamples
- predict(): Binary classification (1/-1)
- score_samples(): Anomaly scores

Key Algorithm Steps:

Training (fit):
- For each of N trees:
  - Sample random subset (max_samples)
  - Build tree via random splits
  - Store tree in ensemble
Tree Building (build_tree):
- Terminal: depth >= max_depth OR n_samples <= 1
- Pick random feature
- Pick random split value between min/max
- Recursively build left/right subtrees
Scoring (score_samples):
- For each sample:
  - Compute path length in each tree
  - Average across ensemble
  - Normalize: 2^(-avg_path / c_norm)
  - Invert (lower = more anomalous)
Classification (predict):
- Compute anomaly scores
- Compare to threshold (from contamination)
- Return 1 (normal) or -1 (anomaly)

Numerical Considerations:

Random subsampling for efficiency
Path length normalization via c(n) function
Threshold computed from training data quantile
Default max_samples: min(256, n_samples)

REFACTOR Phase

Removed unused imports
Zero clippy warnings
Exported in prelude for easy access
Comprehensive documentation with examples
Added fraud detection example scenario

Final State:

Tests: 613 passing (596 → 613, +17)
Zero warnings
All quality gates passing

Algorithm Details

Isolation Forest:

Ensemble method for anomaly detection
Intuition: Anomalies are easier to isolate than normal points
Shorter path length → More anomalous

Time Complexity: O(n log m)

n = samples, m = max_samples

Space Complexity: O(t * m * d)

t = n_estimators, m = max_samples, d = features

Average Path Length (c function):

c(n) = 2H(n-1) - 2(n-1)/n
where H(n) is harmonic number ≈ ln(n) + 0.5772

This normalizes path lengths by expected BST depth.

Parameters

n_estimators (default: 100): Number of trees in ensemble
- More trees = more stable predictions
- Diminishing returns after ~100 trees
max_samples (default: min(256, n)): Subsample size per tree
- Smaller = faster training
- 256 is empirically good default
- Full sample rarely needed
contamination (default: 0.1): Expected anomaly proportion
- Range: 0.0 to 0.5
- Sets classification threshold
- 0.1 = 10% anomalies expected
random_state (optional): Seed for reproducibility

Example Highlights

The example demonstrates:

Basic anomaly detection (8 normal + 2 outliers)
Anomaly score interpretation
Contamination parameter effects (10%, 20%, 30%)
Ensemble size comparison (10 vs 100 trees)
Credit card fraud detection scenario
Reproducibility with random seeds
Isolation path length concept
Max samples parameter

Key Takeaways

Unsupervised Anomaly Detection: No labeled data required
Fast Training: O(n log m) makes it scalable
Interpretable Scores: Path length has clear meaning
Few Parameters: Easy to use with sensible defaults
No Distance Metric: Works with any feature types
Handles High Dimensions: Better than density-based methods
Ensemble Benefits: Averaging reduces variance

Comparison with Other Methods

vs K-Means:

K-Means: Finds clusters, requires distance threshold for anomalies
Isolation Forest: Directly detects anomalies, no threshold needed

vs DBSCAN:

DBSCAN: Density-based, requires eps/min_samples tuning
Isolation Forest: Contamination parameter is intuitive

vs GMM:

GMM: Probabilistic, assumes Gaussian distributions
Isolation Forest: No distributional assumptions

vs One-Class SVM:

SVM: O(n²) to O(n³) training time
Isolation Forest: O(n log m) - much faster

Use Cases

Fraud Detection: Credit card transactions, insurance claims
Network Security: Intrusion detection, anomalous traffic
Quality Control: Manufacturing defects, sensor anomalies
System Monitoring: Server metrics, application logs
Healthcare: Rare disease detection, unusual patient profiles

Testing Strategy

Property-Based Tests (future work):

Score ranges: All scores should be finite
Contamination consistency: Higher contamination → more anomalies
Reproducibility: Same seed → same results
Path length bounds: 0 ≤ path ≤ log2(max_samples)

Unit Tests (17 implemented):

Correctness: Detects clear outliers
API contracts: Panic before fit, return expected types
Parameters: All builder methods work
Edge cases: All normal, all anomalous, small datasets