Weak Supervision Theory
Weak supervision uses noisy, limited, or imprecise labels to train models when perfect labels are unavailable or expensive.
The Labeling Bottleneck
| Data Type | Scale | Label Cost |
|---|---|---|
| Web text | Billions | $0 (unlabeled) |
| Reviews with stars | Millions | Free (noisy) |
| Expert annotations | Thousands | $50-500/sample |
Weak supervision bridges the gap between unlabeled and perfectly labeled data.
Types of Weak Supervision
1. Incomplete Supervision
Only some samples are labeled:
Dataset: [x₁, x₂, x₃, x₄, x₅, ...]
Labels: [y₁, ?, ?, y₄, ?, ...]
Approaches: Semi-supervised learning, self-training
2. Inexact Supervision
Labels at coarser granularity:
Document: "The movie was great but too long"
Document label: Positive (but sentence 2 is negative)
Approaches: Multiple instance learning, attention
3. Inaccurate Supervision
Labels contain errors:
True label: Positive
Noisy label: Negative (human error)
Approaches: Noise modeling, co-teaching
Labeling Functions
Programmatic rules that generate noisy labels:
# Labeling function for sentiment
def lf_positive_words(text):
if any(word in text for word in ["great", "amazing", "excellent"]):
return POSITIVE
return ABSTAIN
def lf_negative_words(text):
if any(word in text for word in ["terrible", "awful", "bad"]):
return NEGATIVE
return ABSTAIN
Properties
| Property | Description |
|---|---|
| Coverage | Fraction of samples labeled |
| Accuracy | Correctness when not abstaining |
| Overlap | Agreement between LFs |
| Conflict | Disagreement between LFs |
Label Model
Aggregate multiple labeling functions:
LF₁ LF₂ LF₃ LF₄
\ / \ / /
▼ ▼ ▼▼ ▼
Probabilistic Label
│
▼
True Label (latent)
Data Programming (Snorkel):
- Model LF accuracies and correlations
- Infer probabilistic labels
- Train end model on soft labels
Noise-Aware Training
Forward Correction
Model the noise transition:
P(ỹ|x) = Σᵧ P(ỹ|y) · P(y|x)
│
Noise matrix T
Backward Correction
Weight loss by estimated noise:
L = Σᵢ wᵢ · loss(fθ(xᵢ), ỹᵢ)
Where wᵢ reflects label confidence.
Co-Teaching
Two networks teach each other:
Network A → Select small-loss samples → Train Network B
Network B → Select small-loss samples → Train Network A
Exploits memorization difference for clean vs noisy samples.
Semi-Supervised Learning
Use unlabeled data with few labels:
Self-Training
1. Train on labeled data
2. Predict on unlabeled data
3. Add confident predictions to training set
4. Repeat
Consistency Regularization
L = L_supervised + λ · ||f(x) - f(aug(x))||²
Predictions should be consistent under augmentation.
MixMatch / FixMatch
Combine:
- Pseudo-labeling
- Consistency regularization
- Data augmentation
Crowdsourcing
Aggregate labels from multiple annotators:
Majority Vote
ŷ = mode(y₁, y₂, ..., yₙ)
Simple but ignores annotator quality.
Dawid-Skene Model
Model annotator reliability:
P(yⱼ|y*) = confusion matrix for annotator j
EM algorithm estimates true labels and annotator accuracies.
Quality Estimation
Label Quality Score
Score(x, ỹ) = P(y* = ỹ | x, model)
Low scores indicate potential label errors.
Confident Learning
- Estimate joint P(y*, ỹ)
- Identify samples where y* ≠ ỹ
- Prune, re-weight, or correct
References
- Ratner, A., et al. (2017). "Snorkel: Rapid Training Data Creation with Weak Supervision." VLDB.
- Han, B., et al. (2018). "Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels." NeurIPS.
- Northcutt, C., et al. (2021). "Confident Learning: Estimating Uncertainty in Dataset Labels." JAIR.