Online Learning Theory

Online learning is a machine learning paradigm where models update incrementally as new data arrives, rather than requiring full retraining on the entire dataset. This is essential for streaming applications, real-time systems, and scenarios where data distribution changes over time.

Core Concepts

Batch vs Online Learning

Batch Learning:

Train on entire dataset at once
O(n) memory for n samples
Requires full retraining for updates
Suitable for static datasets

Online Learning:

Update model one sample at a time
O(1) memory per update
Incremental updates without retraining
Suitable for streaming data

The Regret Framework

Online learning is analyzed using regret: the difference between the learner's cumulative loss and the best fixed hypothesis in hindsight.

Regret_T = Σ_{t=1}^T l(ŷ_t, y_t) - min_h Σ_{t=1}^T l(h(x_t), y_t)

A good online algorithm achieves sublinear regret: O(√T) for convex losses.

Online Gradient Descent

The fundamental online learning algorithm:

w_{t+1} = w_t - η_t ∇l(w_t; x_t, y_t)

Learning Rate Schedules

Schedule	Formula	Use Case
Constant	η_t = η_0	Stationary distributions
Inverse	η_t = η_0 / t	Convex, bounded gradients
Inverse Sqrt	η_t = η_0 / √t	Strongly convex losses
AdaGrad	η_{t,i} = η_0 / √(Σ g²_{s,i})	Sparse features

Implementation in Aprender

use aprender::online::{
    OnlineLearner, OnlineLearnerConfig, OnlineLinearRegression,
    LearningRateDecay,
};

// Configure online learner
let config = OnlineLearnerConfig {
    learning_rate: 0.01,
    decay: LearningRateDecay::InverseSqrt,
    l2_reg: 0.001,
    ..Default::default()
};

let mut model = OnlineLinearRegression::with_config(2, config);

// Incremental updates
for (x, y) in streaming_data {
    let loss = model.partial_fit(&x, &[y], None)?;
    println!("Loss: {:.4}", loss);
}

Concept Drift

Real-world data distributions change over time. Concept drift occurs when the relationship P(Y|X) changes, degrading model performance.

Types of Drift

Sudden Drift: Abrupt distribution change (e.g., system upgrade)
Gradual Drift: Slow transition between concepts
Incremental Drift: Continuous small changes
Recurring Drift: Cyclic patterns (e.g., seasonality)

Drift Detection Methods

DDM (Drift Detection Method)

Monitors error rate statistics [Gama et al., 2004]:

use aprender::online::drift::{DDM, DriftDetector, DriftStatus};

let mut ddm = DDM::new();

for prediction_error in errors {
    ddm.add_element(prediction_error);

    match ddm.detected_change() {
        DriftStatus::Drift => println!("Drift detected! Retrain model."),
        DriftStatus::Warning => println!("Warning: potential drift"),
        DriftStatus::Stable => {}
    }
}

ADWIN (Adaptive Windowing)

Maintains adaptive window size [Bifet & Gavalda, 2007]:

Automatically adjusts window to recent relevant data
Detects both sudden and gradual drift
Recommended default for most applications

use aprender::online::drift::{ADWIN, DriftDetector};

let mut adwin = ADWIN::with_delta(0.002);  // 99.8% confidence

// Add observations
adwin.add_element(true);  // error
adwin.add_element(false); // correct

println!("Window size: {}", adwin.window_size());
println!("Mean error: {:.3}", adwin.mean());

Curriculum Learning

Training on samples ordered by difficulty, from easy to hard [Bengio et al., 2009]:

Benefits

Faster convergence
Better generalization
Avoids local minima from hard examples early
Mimics human learning progression

Implementation

use aprender::online::curriculum::{
    LinearCurriculum, CurriculumScheduler,
    FeatureNormScorer, DifficultyScorer,
};

// Linear difficulty progression over 5 stages
let mut curriculum = LinearCurriculum::new(5);

// Score samples by feature norm (larger = harder)
let scorer = FeatureNormScorer::new();

for sample in &samples {
    let difficulty = scorer.score(&sample.features, 0.0);

    // Only train on samples below current threshold
    if difficulty <= curriculum.current_threshold() {
        model.partial_fit(&sample.features, &sample.target, None)?;
    }
}

// Advance to next curriculum stage
curriculum.advance();

Knowledge Distillation

Transfer knowledge from a complex "teacher" model to a simpler "student" model [Hinton et al., 2015].

Temperature Scaling

Softmax with temperature T reveals "dark knowledge":

p_i = exp(z_i/T) / Σ_j exp(z_j/T)

T=1: Standard softmax (hard targets)
T>1: Softer probability distribution
T=3: Recommended default for distillation

use aprender::online::distillation::{
    softmax_temperature, DEFAULT_TEMPERATURE,
};

let teacher_logits = vec![1.0, 3.0, 0.5];

// Hard targets (T=1)
let hard = softmax_temperature(&teacher_logits, 1.0);
// [0.111, 0.821, 0.067]

// Soft targets (T=3, default)
let soft = softmax_temperature(&teacher_logits, DEFAULT_TEMPERATURE);
// [0.264, 0.513, 0.223]

Distillation Loss

Combined loss with hard labels and soft targets:

L = α * KL(soft_student || soft_teacher) + (1-α) * CE(student, labels)

use aprender::online::distillation::{DistillationConfig, DistillationLoss};

let config = DistillationConfig {
    temperature: 3.0,
    alpha: 0.7,  // 70% distillation, 30% hard labels
    learning_rate: 0.01,
    l2_reg: 0.0,
};

let loss = DistillationLoss::with_config(config);
let distill_loss = loss.compute(&student_logits, &teacher_logits, &hard_labels)?;

Corpus Management

Managing training data in memory-constrained streaming scenarios.

Eviction Policies

Policy	Description	Use Case
FIFO	Remove oldest samples	Simple, predictable
Reservoir	Random sampling, uniform distribution	Statistical sampling
Importance	Keep high-loss samples	Hard example mining
Diversity	Maximize feature space coverage	Avoid redundancy

Sample Deduplication

Hash-based deduplication prevents redundant samples:

use aprender::online::corpus::{CorpusBuffer, CorpusBufferConfig, EvictionPolicy};

let config = CorpusBufferConfig {
    max_size: 1000,
    policy: EvictionPolicy::Reservoir,
    deduplicate: true,  // Hash-based deduplication
    seed: Some(42),
};

let mut buffer = CorpusBuffer::with_config(config);

RetrainOrchestrator

Automated pipeline combining all components:

use aprender::online::{
    OnlineLinearRegression,
    orchestrator::OrchestratorBuilder,
};

let model = OnlineLinearRegression::new(n_features);
let mut orchestrator = OrchestratorBuilder::new(model, n_features)
    .min_samples(100)           // Min samples before retraining
    .max_buffer_size(10000)     // Corpus capacity
    .incremental_updates(true)  // Enable partial_fit
    .curriculum_learning(true)  // Easy-to-hard ordering
    .curriculum_stages(5)       // 5 difficulty levels
    .adwin_delta(0.002)         // Drift sensitivity
    .build();

// Process streaming predictions
for (features, target, prediction) in stream {
    match orchestrator.observe(&features, &target, &prediction)? {
        ObserveResult::Stable => {}
        ObserveResult::Warning => println!("Potential drift detected"),
        ObserveResult::Retrained => println!("Model retrained"),
    }
}

Mathematical Foundations

Convergence Guarantees

For convex loss functions with bounded gradients ||∇l|| ≤ G:

SGD with η_t = η/√t:

E[Regret_T] ≤ O(√T)

AdaGrad:

Regret_T ≤ O(√T) with adaptive per-coordinate rates

ADWIN Theoretical Properties

ADWIN guarantees [Bifet & Gavalda, 2007]:

False positive rate bounded by δ
Window contains only data from current distribution
Memory: O(log(W)/ε²) where W is window size

References

Gama, J., et al. (2004). "Learning with drift detection." SBIA 2004.
Bifet, A., & Gavalda, R. (2007). "Learning from time-changing data with adaptive windowing." SDM 2007.
Bengio, Y., et al. (2009). "Curriculum learning." ICML 2009.
Hinton, G., et al. (2015). "Distilling the knowledge in a neural network." NIPS 2014 Workshop.
Duchi, J., et al. (2011). "Adaptive subgradient methods for online learning." JMLR.
Shalev-Shwartz, S. (2012). "Online learning and online convex optimization." Foundations and Trends in ML.
Hazan, E. (2016). "Introduction to online convex optimization." Foundations and Trends in Optimization.
Lu, J., et al. (2018). "Learning under concept drift: A review." IEEE TKDE.
Wang, H., & Abraham, Z. (2015). "Concept drift detection for streaming data." IJCNN 2015.
Gomes, H.M., et al. (2017). "A survey on ensemble learning for data stream classification." ACM Computing Surveys.

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning