Weak Supervision Theory

Weak supervision uses noisy, limited, or imprecise labels to train models when perfect labels are unavailable or expensive.

The Labeling Bottleneck

Data TypeScaleLabel Cost
Web textBillions$0 (unlabeled)
Reviews with starsMillionsFree (noisy)
Expert annotationsThousands$50-500/sample

Weak supervision bridges the gap between unlabeled and perfectly labeled data.

Types of Weak Supervision

1. Incomplete Supervision

Only some samples are labeled:

Dataset: [x₁, x₂, x₃, x₄, x₅, ...]
Labels:  [y₁,  ?,  ?, y₄,  ?, ...]

Approaches: Semi-supervised learning, self-training

2. Inexact Supervision

Labels at coarser granularity:

Document: "The movie was great but too long"
Document label: Positive (but sentence 2 is negative)

Approaches: Multiple instance learning, attention

3. Inaccurate Supervision

Labels contain errors:

True label: Positive
Noisy label: Negative (human error)

Approaches: Noise modeling, co-teaching

Labeling Functions

Programmatic rules that generate noisy labels:

# Labeling function for sentiment
def lf_positive_words(text):
    if any(word in text for word in ["great", "amazing", "excellent"]):
        return POSITIVE
    return ABSTAIN

def lf_negative_words(text):
    if any(word in text for word in ["terrible", "awful", "bad"]):
        return NEGATIVE
    return ABSTAIN

Properties

PropertyDescription
CoverageFraction of samples labeled
AccuracyCorrectness when not abstaining
OverlapAgreement between LFs
ConflictDisagreement between LFs

Label Model

Aggregate multiple labeling functions:

       LF₁    LF₂    LF₃    LF₄
         \    /  \  /    /
          ▼  ▼    ▼▼    ▼
         Probabilistic Label
               │
               ▼
          True Label (latent)

Data Programming (Snorkel):

  • Model LF accuracies and correlations
  • Infer probabilistic labels
  • Train end model on soft labels

Noise-Aware Training

Forward Correction

Model the noise transition:

P(ỹ|x) = Σᵧ P(ỹ|y) · P(y|x)
           │
      Noise matrix T

Backward Correction

Weight loss by estimated noise:

L = Σᵢ wᵢ · loss(fθ(xᵢ), ỹᵢ)

Where wᵢ reflects label confidence.

Co-Teaching

Two networks teach each other:

Network A → Select small-loss samples → Train Network B
Network B → Select small-loss samples → Train Network A

Exploits memorization difference for clean vs noisy samples.

Semi-Supervised Learning

Use unlabeled data with few labels:

Self-Training

1. Train on labeled data
2. Predict on unlabeled data
3. Add confident predictions to training set
4. Repeat

Consistency Regularization

L = L_supervised + λ · ||f(x) - f(aug(x))||²

Predictions should be consistent under augmentation.

MixMatch / FixMatch

Combine:

  • Pseudo-labeling
  • Consistency regularization
  • Data augmentation

Crowdsourcing

Aggregate labels from multiple annotators:

Majority Vote

ŷ = mode(y₁, y₂, ..., yₙ)

Simple but ignores annotator quality.

Dawid-Skene Model

Model annotator reliability:

P(yⱼ|y*) = confusion matrix for annotator j

EM algorithm estimates true labels and annotator accuracies.

Quality Estimation

Label Quality Score

Score(x, ỹ) = P(y* = ỹ | x, model)

Low scores indicate potential label errors.

Confident Learning

  1. Estimate joint P(y*, ỹ)
  2. Identify samples where y* ≠ ỹ
  3. Prune, re-weight, or correct

References

  • Ratner, A., et al. (2017). "Snorkel: Rapid Training Data Creation with Weak Supervision." VLDB.
  • Han, B., et al. (2018). "Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels." NeurIPS.
  • Northcutt, C., et al. (2021). "Confident Learning: Estimating Uncertainty in Dataset Labels." JAIR.