Cross-Validation Theory

Chapter Status: ✅ 100% Working (All examples verified)

Status	Count	Examples
✅ Working	12+	Case study has comprehensive tests
⏳ In Progress	0	-
⬜ Not Implemented	0	-

Last tested: 2025-11-19 Aprender version: 0.3.0 Test file: tests/integration.rs + src/model_selection/mod.rs tests

Overview

Cross-validation estimates how well a model generalizes to unseen data by systematically testing on held-out portions of the training set. It's the gold standard for model evaluation.

Key Concepts:

K-Fold CV: Split data into K parts, train on K-1, test on 1
Train/Test Split: Simple holdout validation
Reproducibility: Random seeds ensure consistent splits

Why This Matters: Using training accuracy to evaluate a model is like grading your own exam. Cross-validation provides an honest estimate of real-world performance.

Mathematical Foundation

The K-Fold Algorithm

Partition data into K equal-sized folds: D₁, D₂, ..., Dₖ
For each fold i:
- Train on D \ Dᵢ (all data except fold i)
- Test on Dᵢ
- Record score sᵢ
Average scores: CV_score = (1/K) Σ sᵢ

Key Property: Every data point is used for testing exactly once and training exactly K-1 times.

Common K Values:

K=5: Standard choice (80% train, 20% test per fold)
K=10: More thorough but slower
K=n: Leave-One-Out CV (LOOCV) - expensive but low variance

Implementation in Aprender

Example 1: Train/Test Split

use aprender::model_selection::train_test_split;
use aprender::primitives::{Matrix, Vector};

let x = Matrix::from_vec(10, 2, vec![/*...*/]).unwrap();
let y = Vector::from_vec(vec![/*...*/]);

// 80% train, 20% test, reproducible with seed 42
let (x_train, x_test, y_train, y_test) =
    train_test_split(&x, &y, 0.2, Some(42)).unwrap();

assert_eq!(x_train.shape().0, 8);  // 80% of 10
assert_eq!(x_test.shape().0, 2);   // 20% of 10

Test Reference: src/model_selection/mod.rs::tests::test_train_test_split_basic

Example 2: K-Fold Cross-Validation

use aprender::model_selection::{KFold, cross_validate};
use aprender::linear_model::LinearRegression;

let kfold = KFold::new(5)  // 5 folds
    .with_shuffle(true)     // Shuffle data
    .with_random_state(42); // Reproducible

let model = LinearRegression::new();
let result = cross_validate(&model, &x, &y, &kfold).unwrap();

println!("Mean score: {:.3}", result.mean());     // e.g., 0.874
println!("Std dev: {:.3}", result.std());         // e.g., 0.042

Test Reference: src/model_selection/mod.rs::tests::test_cross_validate

Verification: Property Tests

Cross-validation has strong mathematical properties we can verify:

Property 1: Every sample appears in test set exactly once Property 2: Folds are disjoint (no overlap) Property 3: Union of all folds = complete dataset

These are verified in the comprehensive test suite. See Case Study for full property tests.

Practical Considerations

When to Use

✅ Use K-Fold:
- Small/medium datasets (< 10,000 samples)
- Need robust performance estimate
- Hyperparameter tuning
✅ Use Train/Test Split:
- Large datasets (> 100,000 samples) - K-Fold too slow
- Quick evaluation needed
- Final model assessment (after CV for hyperparameters)

Common Pitfalls

Data Leakage: Fitting preprocessing (scaling, imputation) on full dataset before split
- Solution: Fit on training fold only, apply to test fold
Temporal Data: Shuffling time series data breaks temporal order
- Solution: Use time-series split (future work)
Class Imbalance: Random splits may create imbalanced folds
- Solution: Use stratified K-Fold (future work)

Real-World Application

Case Study Reference: See Case Study: Cross-Validation for complete implementation showing:

Full RED-GREEN-REFACTOR workflow
12+ tests covering all edge cases
Property tests proving correctness
Integration with LinearRegression
Reproducibility verification

Key Takeaway: The case study shows EXTREME TDD in action - every requirement becomes a test first.

Summary

What You Learned:

✅ K-Fold algorithm: train on K-1 folds, test on 1
✅ Train/test split for quick evaluation
✅ Reproducibility with random seeds
✅ When to use CV vs simple split

Verification Guarantee: All cross-validation code is extensively tested (12+ tests) as shown in the Case Study. Property tests verify mathematical correctness.

Next Chapter: Gradient Descent Theory

Previous Chapter: Classification Metrics Theory

REQUIRED: Read Case Study: Cross-Validation for complete EXTREME TDD implementation

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning