Ensemble Methods Theory
Chapter Status: ✅ 100% Working (All examples verified)
| Status | Count | Examples |
|---|---|---|
| ✅ Working | 34+ | Random Forest classification + regression + OOB estimation verified |
| ⏳ In Progress | 0 | - |
| ⬜ Not Implemented | 0 | - |
Last tested: 2025-11-21 Aprender version: 0.4.1 Test file: src/tree/mod.rs tests (726 tests, 11 OOB tests)
Overview
Ensemble methods combine multiple models to achieve better performance than any single model. The key insight: many weak learners together make a strong learner.
Key Techniques:
- Bagging: Bootstrap aggregating (Random Forests)
- Boosting: Sequential learning from mistakes (future work)
- Voting: Combine predictions via majority vote
Why This Matters: Single decision trees overfit. Random Forests solve this by averaging many trees trained on different data subsets. Result: lower variance, better generalization.
Mathematical Foundation
The Ensemble Principle
Problem: Single model has high variance Solution: Average predictions from multiple models
Ensemble_prediction = Aggregate(model₁, model₂, ..., modelₙ)
For classification: Majority vote
For regression: Mean prediction
Key Insight: If models make uncorrelated errors, averaging reduces overall error.
Variance Reduction Through Averaging
Mathematical property:
Var(Average of N models) = Var(single model) / N
(assuming independent, identically distributed models)
In practice: Models aren't fully independent, but ensemble still reduces variance significantly.
Bagging (Bootstrap Aggregating)
Algorithm:
1. For i = 1 to N:
- Create bootstrap sample Dᵢ (sample with replacement from D)
- Train model Mᵢ on Dᵢ
2. Prediction = Majority_vote(M₁, M₂, ..., Mₙ)
Bootstrap Sample:
- Size: Same as original dataset (n samples)
- Sampling: With replacement (some samples repeated, some excluded)
- Out-of-Bag (OOB): ~37% of samples not in each bootstrap sample
Why it works: Each model sees slightly different data → diverse models → uncorrelated errors
Random Forests: Bagging + Feature Randomness
The Random Forest Algorithm
Random Forests extend bagging with feature randomness:
function RandomForest(X, y, n_trees, max_features):
forest = []
for i = 1 to n_trees:
# Bootstrap sampling
D_i = bootstrap_sample(X, y)
# Train tree with feature randomness
tree = DecisionTree(max_features=sqrt(n_features))
tree.fit(D_i)
forest.append(tree)
return forest
function Predict(forest, x):
votes = [tree.predict(x) for tree in forest]
return majority_vote(votes)
Two Sources of Randomness:
- Bootstrap sampling: Each tree sees different data subset
- Feature randomness: At each split, only consider random subset of features (typically √m features)
Why feature randomness? Prevents correlation between trees. Without it, all trees would use the same strong features at the top.
Out-of-Bag (OOB) Error Estimation
Key Insight: Each tree trained on ~63% of data, leaving ~37% out-of-bag
The Mathematics:
Bootstrap sampling with replacement:
- Probability sample is NOT selected: (1 - 1/n)ⁿ
- As n → ∞: lim (1 - 1/n)ⁿ = 1/e ≈ 0.368
- Therefore: ~36.8% samples are OOB per tree
OOB Score Algorithm:
For each sample xᵢ in training set:
1. Find all trees where xᵢ was NOT in bootstrap sample
2. Predict using only those trees
3. Aggregate predictions (majority vote or averaging)
Classification: OOB_accuracy = accuracy(oob_predictions, y_true)
Regression: OOB_R² = r_squared(oob_predictions, y_true)
Why OOB is Powerful:
- ✅ Free validation: No separate validation set needed
- ✅ Unbiased estimate: Similar to cross-validation accuracy
- ✅ Use all data: 100% for training, still get validation score
- ✅ Model selection: Compare different n_estimators values
- ✅ Early stopping: Monitor OOB score during training
When to Use OOB:
- Small datasets (can't afford to hold out validation set)
- Hyperparameter tuning (test different forest sizes)
- Production monitoring (track OOB score over time)
Practical Usage in Aprender:
use aprender::tree::RandomForestClassifier;
use aprender::primitives::Matrix;
let mut rf = RandomForestClassifier::new(50)
.with_max_depth(10)
.with_random_state(42);
rf.fit(&x_train, &y_train).unwrap();
// Get OOB score (unbiased estimate of generalization error)
let oob_accuracy = rf.oob_score().unwrap();
let training_accuracy = rf.score(&x_train, &y_train);
println!("Training accuracy: {:.3}", training_accuracy); // Often high
println!("OOB accuracy: {:.3}", oob_accuracy); // More realistic
// OOB accuracy typically close to test set accuracy!
Test Reference: src/tree/mod.rs::tests::test_random_forest_classifier_oob_score_after_fit
Implementation in Aprender
Example 1: Basic Random Forest
use aprender::tree::RandomForestClassifier;
use aprender::primitives::Matrix;
// XOR problem (not linearly separable)
let x = Matrix::from_vec(4, 2, vec![
0.0, 0.0, // Class 0
0.0, 1.0, // Class 1
1.0, 0.0, // Class 1
1.0, 1.0, // Class 0
]).unwrap();
let y = vec![0, 1, 1, 0];
// Random Forest with 10 trees
let mut forest = RandomForestClassifier::new(10)
.with_max_depth(5)
.with_random_state(42); // Reproducible
forest.fit(&x, &y).unwrap();
// Predict
let predictions = forest.predict(&x);
println!("Predictions: {:?}", predictions); // [0, 1, 1, 0]
let accuracy = forest.score(&x, &y);
println!("Accuracy: {:.3}", accuracy); // 1.000
Test Reference: src/tree/mod.rs::tests::test_random_forest_fit_basic
Example 2: Multi-Class Classification (Iris)
// Iris dataset (3 classes, 4 features)
// Simplified - see case study for full implementation
let mut forest = RandomForestClassifier::new(100) // 100 trees
.with_max_depth(10)
.with_random_state(42);
forest.fit(&x_train, &y_train).unwrap();
// Test set evaluation
let y_pred = forest.predict(&x_test);
let accuracy = forest.score(&x_test, &y_test);
println!("Test Accuracy: {:.3}", accuracy); // e.g., 0.973
// Random Forest typically outperforms single tree!
Case Study: See Random Forest - Iris Classification
Example 3: Reproducibility
// Same random_state → same results
let mut forest1 = RandomForestClassifier::new(50)
.with_random_state(42);
forest1.fit(&x, &y).unwrap();
let mut forest2 = RandomForestClassifier::new(50)
.with_random_state(42);
forest2.fit(&x, &y).unwrap();
// Predictions identical
assert_eq!(forest1.predict(&x), forest2.predict(&x));
Test Reference: src/tree/mod.rs::tests::test_random_forest_reproducible
Random Forest Regression
Random Forests also work for regression tasks (predicting continuous values) using the same bagging principle with a key difference: instead of majority voting, predictions are averaged across all trees.
Algorithm for Regression
use aprender::tree::RandomForestRegressor;
use aprender::primitives::{Matrix, Vector};
// Housing data: [sqft, bedrooms, age] → price
let x = Matrix::from_vec(8, 3, vec![
1500.0, 3.0, 10.0, // $280k
2000.0, 4.0, 5.0, // $350k
1200.0, 2.0, 30.0, // $180k
// ... more samples
]).unwrap();
let y = Vector::from_slice(&[280.0, 350.0, 180.0, /* ... */]);
// Train Random Forest Regressor
let mut rf = RandomForestRegressor::new(50)
.with_max_depth(8)
.with_random_state(42);
rf.fit(&x, &y).unwrap();
// Predict: Average predictions from all 50 trees
let predictions = rf.predict(&x);
let r2 = rf.score(&x, &y); // R² coefficient
Test Reference: src/tree/mod.rs::tests::test_random_forest_regressor_*
Prediction Aggregation for Regression
Classification:
Prediction = mode([tree₁(x), tree₂(x), ..., treeₙ(x)]) # Majority vote
Regression:
Prediction = mean([tree₁(x), tree₂(x), ..., treeₙ(x)]) # Average
Why averaging works:
- Each tree makes different errors due to bootstrap sampling
- Errors cancel out when averaged
- Result: smoother, more stable predictions
Variance Reduction in Regression
Single Decision Tree:
- High variance (sensitive to data changes)
- Can overfit training data
- Predictions can be "jumpy" (discontinuous)
Random Forest Ensemble:
- Lower variance: Var(RF) ≈ Var(Tree) / √n_trees
- Averaging smooths out individual tree predictions
- More robust to outliers and noise
Example:
Sample: [2000 sqft, 3 bed, 10 years]
Tree 1 predicts: $305k
Tree 2 predicts: $295k
Tree 3 predicts: $310k
...
Tree 50 predicts: $302k
Random Forest prediction: mean = $303k (stable!)
Single tree might predict: $310k or $295k (unstable)
Comparison: Regression vs Classification
| Aspect | Random Forest Regression | Random Forest Classification |
|---|---|---|
| Task | Predict continuous values | Predict discrete classes |
| Base learner | DecisionTreeRegressor | DecisionTreeClassifier |
| Split criterion | MSE (variance reduction) | Gini impurity |
| Leaf prediction | Mean of samples | Majority class |
| Aggregation | Average predictions | Majority vote |
| Evaluation | R² score, MSE, MAE | Accuracy, F1 score |
| Output | Real number (e.g., $305k) | Class label (e.g., 0, 1, 2) |
When to Use Random Forest Regression
✅ Good for:
- Non-linear relationships (e.g., housing prices)
- Feature interactions (e.g., size × location)
- Outlier robustness
- When single tree overfits
- Want stable predictions (low variance)
❌ Not ideal for:
- Linear relationships (use LinearRegression)
- Need smooth predictions (trees predict step functions)
- Extrapolation beyond training range
- Very small datasets (< 50 samples)
Example: Housing Price Prediction
// Non-linear housing data
let x = Matrix::from_vec(20, 4, vec![
1000.0, 2.0, 1.0, 50.0, // $140k (small, old)
2500.0, 5.0, 3.0, 3.0, // $480k (large, new)
// ... quadratic relationship between size and price
]).unwrap();
let y = Vector::from_slice(&[140.0, 480.0, /* ... */]);
// Train Random Forest
let mut rf = RandomForestRegressor::new(30).with_max_depth(6);
rf.fit(&x, &y).unwrap();
// Compare with single tree
let mut single_tree = DecisionTreeRegressor::new().with_max_depth(6);
single_tree.fit(&x, &y).unwrap();
let rf_r2 = rf.score(&x, &y); // e.g., 0.95
let tree_r2 = single_tree.score(&x, &y); // e.g., 1.00 (overfit!)
// On test data:
// RF generalizes better due to averaging
Case Study: See Random Forest Regression
Hyperparameter Recommendations for Regression
Default configuration:
n_estimators = 50-100(more trees = more stable)max_depth = 8-12(can be deeper than classification trees)- No min_samples_split needed (averaging handles overfitting)
Tuning strategy:
- Start with 50 trees, max_depth=8
- Check train vs test R²
- If overfitting: decrease max_depth or increase min_samples_split
- If underfitting: increase max_depth or n_estimators
- Use cross-validation for final tuning
Hyperparameter Tuning
Number of Trees (n_estimators)
Trade-off:
- Too few (n < 10): High variance, unstable
- Enough (n = 100): Good performance, stable
- Many (n = 500+): Diminishing returns, slower training
Rule of Thumb:
- Start with 100 trees
- More trees never hurt accuracy (just slower)
- Increasing trees reduces overfitting
Finding optimal n:
// Pseudocode
for n in [10, 50, 100, 200, 500] {
forest = RandomForestClassifier::new(n);
cv_score = cross_validate(forest, x, y, k=5);
// Select n with best cv_score (or when improvement plateaus)
}
Max Depth (max_depth)
Trade-off:
- Shallow trees (max_depth = 3): Underfitting
- Deep trees (max_depth = 20+): OK for Random Forests! (bagging reduces overfitting)
- Unlimited depth: Common in Random Forests (unlike single trees)
Random Forest advantage: Can use deeper trees than single decision tree without overfitting.
Feature Randomness (max_features)
Typical values:
- Classification: max_features = √m (where m = total features)
- Regression: max_features = m/3
Trade-off:
- Low (e.g., 1): Very diverse trees, may miss important features
- High (e.g., m): Correlated trees, loses ensemble benefit
- Sqrt(m): Good balance (recommended default)
Random Forest vs Single Decision Tree
Comparison Table
| Property | Single Tree | Random Forest |
|---|---|---|
| Overfitting | High | Low (averaging reduces variance) |
| Stability | Low (small data changes → different tree) | High (ensemble is stable) |
| Interpretability | High (can visualize) | Medium (100 trees hard to interpret) |
| Training Speed | Fast | Slower (train N trees) |
| Prediction Speed | Very fast | Slower (N predictions + voting) |
| Accuracy | Good | Better (typically +5-15% improvement) |
Empirical Example
Scenario: Iris classification (150 samples, 4 features, 3 classes)
| Model | Test Accuracy |
|---|---|
| Single Decision Tree (max_depth=5) | 93.3% |
| Random Forest (100 trees, max_depth=10) | 97.3% |
Improvement: +4% absolute, ~60% reduction in error rate!
Advantages and Limitations
Advantages ✅
- Reduced overfitting: Averaging reduces variance
- Robust: Handles noise, outliers well
- Feature importance: Can rank feature importance across forest
- No feature scaling: Inherits from decision trees
- Handles missing values: Can impute or split on missingness
- Parallel training: Trees are independent (can train in parallel)
- OOB score: Free validation estimate
Limitations ❌
- Less interpretable: 100 trees vs 1 tree
- Memory: Stores N trees (larger model size)
- Slower prediction: Must query N trees
- Black box: Hard to explain individual predictions (vs single tree)
- Extrapolation: Can't predict outside training data range
Understanding Bootstrap Sampling
Bootstrap Sample Properties
Original dataset: 100 samples [S₁, S₂, ..., S₁₀₀]
Bootstrap sample (with replacement):
- Some samples appear 0 times (out-of-bag)
- Some samples appear 1 time
- Some samples appear 2+ times
Probability analysis:
P(sample not chosen in one draw) = (n-1)/n
P(sample not in bootstrap, after n draws) = ((n-1)/n)ⁿ
As n → ∞: ((n-1)/n)ⁿ → 1/e ≈ 0.37
Result: ~37% of samples are out-of-bag
Test Reference: src/tree/mod.rs::tests::test_bootstrap_sample_*
Diversity Through Sampling
Example: Dataset with 6 samples [A, B, C, D, E, F]
Bootstrap Sample 1: [A, A, C, D, F, F] (B and E missing) Bootstrap Sample 2: [B, C, C, D, E, E] (A and F missing) Bootstrap Sample 3: [A, B, D, D, E, F] (C missing)
Result: Each tree sees different data → different structure → diverse predictions
Feature Importance
Random Forests naturally compute feature importance:
Method: For each feature, measure total reduction in Gini impurity across all trees
Importance(feature_i) = Σ (over all nodes using feature_i) InfoGain
Normalize: Importance / Σ(all importances)
Interpretation:
- High importance: Feature frequently used for splits, high information gain
- Low importance: Feature rarely used or low information gain
- Zero importance: Feature never used
Use cases:
- Feature selection (drop low-importance features)
- Model interpretation (which features matter most?)
- Domain validation (do important features make sense?)
Real-World Application
Medical Diagnosis: Cancer Detection
Problem: Classify tumor as benign/malignant from 30 measurements
Why Random Forest?:
- Handles high-dimensional data (30 features)
- Robust to measurement noise
- Provides feature importance (which biomarkers matter?)
- Good accuracy (ensemble outperforms single tree)
Result: Random Forest achieves 97% accuracy vs 93% for single tree
Credit Risk Assessment
Problem: Predict loan default from income, debt, employment, credit history
Why Random Forest?:
- Captures non-linear relationships (income × debt interaction)
- Robust to outliers (unusual income values)
- Handles mixed features (numeric + categorical)
Result: Random Forest reduces false negatives by 40% vs logistic regression
Verification Through Tests
Random Forest tests verify ensemble properties:
Bootstrap Tests:
- Bootstrap sample has correct size (n samples)
- Reproducibility (same seed → same sample)
- Coverage (~63% of data in each sample)
Forest Tests:
- Correct number of trees trained
- All trees make predictions
- Majority voting works correctly
- Reproducible with random_state
Test Reference: src/tree/mod.rs (7+ ensemble tests)
Further Reading
Peer-Reviewed Papers
Breiman (2001) - Random Forests
- Relevance: Original Random Forest paper
- Link: SpringerLink (publicly accessible)
- Key Contributions:
- Bagging + feature randomness
- OOB error estimation
- Feature importance computation
- Applied in:
src/tree/mod.rsRandomForestClassifier
Dietterich (2000) - Ensemble Methods in Machine Learning
- Relevance: Survey of ensemble techniques (bagging, boosting, voting)
- Link: SpringerLink
- Key Insight: Why and when ensembles work
Related Chapters
- Decision Trees Theory - Foundation for Random Forests
- Cross-Validation Theory - Tuning hyperparameters
- Classification Metrics Theory - Evaluating ensembles
Summary
What You Learned:
- ✅ Ensemble methods: combine many models → better than any single model
- ✅ Bagging: train on bootstrap samples, average predictions
- ✅ Random Forests: bagging + feature randomness
- ✅ Variance reduction: Var(ensemble) ≈ Var(single) / N
- ✅ OOB score: free validation estimate (~37% out-of-bag)
- ✅ Hyperparameters: n_trees (100+), max_depth (deeper OK), max_features (√m)
- ✅ Advantages: less overfitting, robust, accurate
- ✅ Trade-off: less interpretable, slower than single tree
Verification Guarantee: Random Forest implementation extensively tested (7+ tests) in src/tree/mod.rs. Tests verify bootstrap sampling, tree training, voting, and reproducibility.
Quick Reference:
- Default config: 100 trees, max_depth=10-20, max_features=√m
- Tuning: More trees → better (just slower)
- OOB score: Estimate test accuracy without test set
- Feature importance: Which features matter most?
Key Equations:
Bootstrap: Sample n times with replacement
Prediction: Majority_vote(tree₁, tree₂, ..., treeₙ)
Variance reduction: σ²_ensemble ≈ σ²_tree / N (if independent)
OOB samples: ~37% per tree
Next Chapter: K-Means Clustering Theory
Previous Chapter: Decision Trees Theory