Lottery Ticket Hypothesis
The Lottery Ticket Hypothesis (LTH) reveals that dense neural networks contain sparse subnetworks ("winning tickets") that can train to full accuracy when reset to their initial weights.
Overview
Frankle & Carbin (2018) discovered that randomly-initialized neural networks contain sparse subnetworks that, when trained in isolation from initialization, can match the test accuracy of the full network.
Key Insight
"A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations."
This challenges the common belief that over-parameterization is necessary for training success.
The Algorithm: Iterative Magnitude Pruning (IMP)
The original LTH algorithm uses iterative magnitude pruning with weight rewinding:
Algorithm Steps
- Initialize network with random weights W₀
- Train to convergence → W_T
- Prune the p% smallest-magnitude weights globally
- Rewind remaining weights to their values at W₀
- Repeat steps 2-4 until target sparsity reached
Mathematical Formulation
For target sparsity s after n rounds, the per-round pruning rate is:
p = 1 - (1 - s)^(1/n)
After k rounds, the remaining weights fraction is:
remaining(k) = (1 - p)^k
For example, with 90% target sparsity over 10 rounds:
- p ≈ 0.206 (20.6% per round)
- remaining(10) = 0.1 (10% of weights remain)
Rewind Strategies
Different rewinding strategies affect which weights are restored:
Early Rewinding (W₀)
Rewind to the initial random weights. This is the original LTH formulation.
RewindStrategy::Init
Late Rewinding (W_k)
Rewind to weights from early in training (iteration k). Often k = 0.1T (10% of training).
RewindStrategy::Early { iteration: 100 }
Reference: Frankle et al. (2019) - "Stabilizing the Lottery Ticket Hypothesis"
Learning Rate Rewinding
Rewind the learning rate schedule but keep late weights.
RewindStrategy::Late { fraction: 0.1 } // Rewind to 10% through training
No Rewinding
Keep final trained weights (standard pruning without LTH).
RewindStrategy::None
Implementation in Aprender
Basic Usage
use aprender::pruning::{LotteryTicketPruner, LotteryTicketConfig, RewindStrategy};
use aprender::nn::Linear;
// Create a model
let model = Linear::new(256, 128);
// Configure: 90% sparsity over 10 rounds, rewind to init
let config = LotteryTicketConfig::new(0.9, 10)
.with_rewind_strategy(RewindStrategy::Init);
let pruner = LotteryTicketPruner::with_config(config);
// Find winning ticket
let ticket = pruner.find_ticket(&model).expect("ticket");
println!("Winning ticket:");
println!(" Sparsity: {:.1}%", ticket.sparsity * 100.0);
println!(" Remaining params: {}", ticket.remaining_parameters);
println!(" Compression: {:.1}x", ticket.compression_ratio());
Builder Pattern
use aprender::pruning::{LotteryTicketPruner, RewindStrategy};
let pruner = LotteryTicketPruner::builder()
.target_sparsity(0.95)
.pruning_rounds(15)
.rewind_strategy(RewindStrategy::Early { iteration: 500 })
.global_pruning(true)
.build();
Applying the Ticket
// Apply winning ticket mask and rewind weights
let result = pruner.apply_ticket(&mut model, &ticket).expect("apply");
// The model now has:
// 1. Sparse weights (90% zeros)
// 2. Non-zero weights reset to initial values
Tracking Sparsity History
let ticket = pruner.find_ticket(&model).expect("ticket");
println!("Sparsity progression:");
for (round, sparsity) in ticket.sparsity_history.iter().enumerate() {
println!(" Round {}: {:.1}%", round + 1, sparsity * 100.0);
}
// Output:
// Round 1: 20.6%
// Round 2: 37.0%
// Round 3: 50.0%
// ...
// Round 10: 90.0%
Why Winning Tickets Work
The Lottery Analogy
Training a neural network is like buying lottery tickets:
- Each random initialization creates a "ticket" (subnetwork structure)
- The winning ticket has fortuitously good initial weights
- Over-parameterization increases the chance of containing a winning ticket
Structural vs. Weight Importance
Winning tickets suggest that:
- Structure matters - The connectivity pattern is crucial
- Initial weights matter - Specific initializations enable training
- Pruning identifies structure - Magnitude pruning discovers winning tickets
Comparison with Standard Pruning
| Aspect | Standard Pruning | Lottery Ticket |
|---|---|---|
| When | After training | During/after training |
| Weights | Keep final trained | Rewind to initial |
| Retraining | From pruned state | From initialization |
| Goal | Compress trained model | Find trainable sparse subnet |
Practical Considerations
Computational Cost
LTH requires multiple train-prune-rewind cycles:
Total cost = rounds × training_cost
For 10 rounds, expect ~10x training time.
Hyperparameter Sensitivity
Winning tickets can be sensitive to:
- Learning rate schedule
- Batch size
- Random seed (initialization)
Scaling Challenges
Original LTH is harder to replicate at scale:
- Works well for small networks (MNIST, CIFAR)
- Requires "late rewinding" for larger models (ImageNet, BERT)
Extensions and Variants
One-Shot LTH
Find winning tickets without iterative pruning using sensitivity analysis:
// Approximate winning ticket in one shot
let pruner = LotteryTicketPruner::builder()
.target_sparsity(0.9)
.pruning_rounds(1) // Single round
.build();
Supermask Training
Train only the mask, keeping weights frozen:
mask_i = σ(s_i) // Learned scores
output = mask ⊙ W ⊙ x
Reference: Zhou et al. (2019) - "Deconstructing Lottery Tickets"
Neural Network Pruning at Initialization
Skip training entirely—find winning tickets from random init:
Reference: Lee et al. (2019) - "SNIP: Single-Shot Network Pruning"
Mathematical Properties
Mask Properties
mask ∈ {0, 1}^n // Binary mask
sparsity = sum(mask=0) / n
density = 1 - sparsity
Idempotence
Applying a mask multiple times has the same effect:
apply(apply(W, m), m) = apply(W, m)
Compression Ratio
compression = total_params / remaining_params
= 1 / density
For 90% sparsity: compression = 10x
Quality Metrics
Winning Ticket Quality
A good winning ticket:
- Achieves target accuracy when retrained
- Matches or exceeds dense network accuracy
- Trains in similar or fewer iterations
Sparsity vs. Accuracy Trade-off
Typical behavior:
- Up to 80% sparsity: minimal accuracy loss
- 80-95% sparsity: gradual degradation
-
95% sparsity: significant accuracy drop
References
- Frankle, J., & Carbin, M. (2018). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks." ICLR 2019.
- Frankle, J., et al. (2019). "Stabilizing the Lottery Ticket Hypothesis." arXiv:1903.01611.
- Zhou, H., et al. (2019). "Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask." NeurIPS 2019.
- Lee, N., et al. (2019). "SNIP: Single-shot Network Pruning based on Connection Sensitivity." ICLR 2019.
- Malach, E., et al. (2020). "Proving the Lottery Ticket Hypothesis: Pruning is All You Need." ICML 2020.