Case Study: XOR Neural Network

The XOR problem is the "Hello World" of deep learning - a classic benchmark that proves a neural network can learn non-linear patterns through backpropagation.

Why XOR Matters

XOR (exclusive or) is not linearly separable. No single straight line can separate the classes:

    X2
    │
  1 │  ●(0,1)=1     ○(1,1)=0
    │
    ├───────────────────── X1
    │
  0 │  ○(0,0)=0     ●(1,0)=1
    │
        0           1

This means:

  • Perceptrons fail (single-layer networks)
  • Hidden layers required to create non-linear decision boundaries
  • Proves backpropagation works when the network learns XOR

The Mathematics

Truth Table

X1X2XOR Output
000
011
101
110

Network Architecture

Input(2) → Linear(2→8) → ReLU → Linear(8→1) → Sigmoid
  • Input layer: 2 features (X1, X2)
  • Hidden layer: 8 neurons with ReLU activation
  • Output layer: 1 neuron with Sigmoid (outputs probability)

Total parameters: 2×8 + 8 + 8×1 + 1 = 33

Implementation

use aprender::autograd::{clear_graph, Tensor};
use aprender::nn::{
    loss::MSELoss, optim::SGD, Linear, Module, Optimizer,
    ReLU, Sequential, Sigmoid,
};

fn main() {
    // XOR dataset
    let x = Tensor::new(&[
        0.0, 0.0,  // → 0
        0.0, 1.0,  // → 1
        1.0, 0.0,  // → 1
        1.0, 1.0,  // → 0
    ], &[4, 2]);

    let y = Tensor::new(&[0.0, 1.0, 1.0, 0.0], &[4, 1]);

    // Build network
    let mut model = Sequential::new()
        .add(Linear::with_seed(2, 8, Some(42)))
        .add(ReLU::new())
        .add(Linear::with_seed(8, 1, Some(43)))
        .add(Sigmoid::new());

    // Setup training
    let mut optimizer = SGD::new(model.parameters_mut(), 0.5);
    let loss_fn = MSELoss::new();

    // Training loop
    for epoch in 0..1000 {
        clear_graph();

        // Forward pass
        let x_grad = x.clone().requires_grad();
        let output = model.forward(&x_grad);

        // Compute loss
        let loss = loss_fn.forward(&output, &y);

        // Backward pass
        loss.backward();

        // Update weights
        let mut params = model.parameters_mut();
        optimizer.step_with_params(&mut params);
        optimizer.zero_grad();

        if epoch % 100 == 0 {
            println!("Epoch {}: Loss = {:.6}", epoch, loss.item());
        }
    }

    // Evaluate
    let final_output = model.forward(&x);
    println!("Predictions: {:?}", final_output.data());
}

Training Dynamics

Loss Curve

Epoch     Loss        Accuracy
─────────────────────────────
    0     0.304618      50%
  100     0.081109     100%
  200     0.013253     100%
  300     0.005368     100%
  500     0.002103     100%
 1000     0.000725     100%

The network:

  1. Starts random (50% accuracy = random guessing)
  2. Learns quickly (100% by epoch 100)
  3. Refines confidence (loss continues decreasing)

Final Predictions

InputTargetPredictionConfidence
(0,0)00.03496.6%
(0,1)10.97797.7%
(1,0)10.97497.4%
(1,1)00.02397.7%

Key Concepts Demonstrated

1. Automatic Differentiation

loss.backward();  // Computes ∂L/∂w for all weights

The autograd engine:

  • Records operations during forward pass
  • Computes gradients in reverse (backpropagation)
  • Handles chain rule automatically

2. Non-Linear Activation

.add(ReLU::new())  // f(x) = max(0, x)

ReLU enables the network to learn non-linear decision boundaries. Without it, stacking linear layers would still be linear.

3. Gradient Descent

optimizer.step_with_params(&mut params);

Updates weights: w = w - lr × ∂L/∂w

With learning rate 0.5, the network converges in ~100 epochs.

Running the Example

cargo run --example xor_training

Exercises

  1. Change hidden size: Try 4 or 16 neurons instead of 8
  2. Change learning rate: What happens with lr=0.1 or lr=1.0?
  3. Use Adam optimizer: Replace SGD with Adam
  4. Add another hidden layer: Does it help or hurt?

Common Issues

ProblemCauseSolution
Loss stuck at ~0.25Vanishing gradientsIncrease learning rate
Loss oscillatesLearning rate too highDecrease learning rate
50% accuracyNot learningCheck gradient flow

Theory: Universal Approximation

The XOR example demonstrates the Universal Approximation Theorem: a neural network with one hidden layer can approximate any continuous function, given enough neurons.

XOR requires learning a function like:

f(x1, x2) ≈ x1(1-x2) + x2(1-x1)

The hidden layer learns intermediate features that make this separable.

Next Steps