Case Study: Isolation Forest Implementation
This chapter documents the complete EXTREME TDD implementation of aprender's Isolation Forest algorithm for anomaly detection from Issue #17.
Background
GitHub Issue #17: Implement Isolation Forest for Anomaly Detection
Requirements:
- Ensemble of isolation trees using random partitioning
- O(n log n) training complexity
- Parameters: n_estimators, max_samples, contamination
- Methods: fit(), predict(), score_samples()
- predict() returns 1 for normal, -1 for anomaly
- score_samples() returns anomaly scores (lower = more anomalous)
- Use cases: fraud detection, network intrusion, quality control
Initial State:
- Tests: 596 passing
- Existing clustering: K-Means, DBSCAN, Hierarchical, GMM
- No anomaly detection support
Implementation Summary
RED Phase
Created 17 comprehensive tests covering:
- Constructor and basic fitting
- Anomaly prediction (1=normal, -1=anomaly)
- Anomaly score computation
- Contamination parameter (10%, 20%, 30%)
- Number of trees (ensemble size)
- Max samples (subsample size)
- Reproducibility with random seeds
- Multidimensional data (3+ features)
- Path length calculations
- Decision function consistency
- Error handling (predict/score before fit)
- Edge cases (all normal points)
GREEN Phase
Implemented complete Isolation Forest (387 lines):
Core Components:
-
IsolationNode: Binary tree node structure
- Split feature and value
- Left/right children (Box for recursion)
- Node size (for path length calculation)
-
IsolationTree: Single isolation tree
build_tree(): Recursive random partitioningpath_length(): Compute isolation path lengthc(n): Average BST path length for normalization
-
IsolationForest: Public API
- Ensemble of isolation trees
- Builder pattern (with_* methods)
- fit(): Train ensemble on subsamples
- predict(): Binary classification (1/-1)
- score_samples(): Anomaly scores
Key Algorithm Steps:
-
Training (fit):
- For each of N trees:
- Sample random subset (max_samples)
- Build tree via random splits
- Store tree in ensemble
- For each of N trees:
-
Tree Building (build_tree):
- Terminal: depth >= max_depth OR n_samples <= 1
- Pick random feature
- Pick random split value between min/max
- Recursively build left/right subtrees
-
Scoring (score_samples):
- For each sample:
- Compute path length in each tree
- Average across ensemble
- Normalize: 2^(-avg_path / c_norm)
- Invert (lower = more anomalous)
- For each sample:
-
Classification (predict):
- Compute anomaly scores
- Compare to threshold (from contamination)
- Return 1 (normal) or -1 (anomaly)
Numerical Considerations:
- Random subsampling for efficiency
- Path length normalization via c(n) function
- Threshold computed from training data quantile
- Default max_samples: min(256, n_samples)
REFACTOR Phase
- Removed unused imports
- Zero clippy warnings
- Exported in prelude for easy access
- Comprehensive documentation with examples
- Added fraud detection example scenario
Final State:
- Tests: 613 passing (596 → 613, +17)
- Zero warnings
- All quality gates passing
Algorithm Details
Isolation Forest:
- Ensemble method for anomaly detection
- Intuition: Anomalies are easier to isolate than normal points
- Shorter path length → More anomalous
Time Complexity: O(n log m)
- n = samples, m = max_samples
Space Complexity: O(t * m * d)
- t = n_estimators, m = max_samples, d = features
Average Path Length (c function):
c(n) = 2H(n-1) - 2(n-1)/n
where H(n) is harmonic number ≈ ln(n) + 0.5772
This normalizes path lengths by expected BST depth.
Parameters
-
n_estimators (default: 100): Number of trees in ensemble
- More trees = more stable predictions
- Diminishing returns after ~100 trees
-
max_samples (default: min(256, n)): Subsample size per tree
- Smaller = faster training
- 256 is empirically good default
- Full sample rarely needed
-
contamination (default: 0.1): Expected anomaly proportion
- Range: 0.0 to 0.5
- Sets classification threshold
- 0.1 = 10% anomalies expected
-
random_state (optional): Seed for reproducibility
Example Highlights
The example demonstrates:
- Basic anomaly detection (8 normal + 2 outliers)
- Anomaly score interpretation
- Contamination parameter effects (10%, 20%, 30%)
- Ensemble size comparison (10 vs 100 trees)
- Credit card fraud detection scenario
- Reproducibility with random seeds
- Isolation path length concept
- Max samples parameter
Key Takeaways
- Unsupervised Anomaly Detection: No labeled data required
- Fast Training: O(n log m) makes it scalable
- Interpretable Scores: Path length has clear meaning
- Few Parameters: Easy to use with sensible defaults
- No Distance Metric: Works with any feature types
- Handles High Dimensions: Better than density-based methods
- Ensemble Benefits: Averaging reduces variance
Comparison with Other Methods
vs K-Means:
- K-Means: Finds clusters, requires distance threshold for anomalies
- Isolation Forest: Directly detects anomalies, no threshold needed
vs DBSCAN:
- DBSCAN: Density-based, requires eps/min_samples tuning
- Isolation Forest: Contamination parameter is intuitive
vs GMM:
- GMM: Probabilistic, assumes Gaussian distributions
- Isolation Forest: No distributional assumptions
vs One-Class SVM:
- SVM: O(n²) to O(n³) training time
- Isolation Forest: O(n log m) - much faster
Use Cases
- Fraud Detection: Credit card transactions, insurance claims
- Network Security: Intrusion detection, anomalous traffic
- Quality Control: Manufacturing defects, sensor anomalies
- System Monitoring: Server metrics, application logs
- Healthcare: Rare disease detection, unusual patient profiles
Testing Strategy
Property-Based Tests (future work):
- Score ranges: All scores should be finite
- Contamination consistency: Higher contamination → more anomalies
- Reproducibility: Same seed → same results
- Path length bounds: 0 ≤ path ≤ log2(max_samples)
Unit Tests (17 implemented):
- Correctness: Detects clear outliers
- API contracts: Panic before fit, return expected types
- Parameters: All builder methods work
- Edge cases: All normal, all anomalous, small datasets