Case Study: XOR Neural Network

The XOR problem is the "Hello World" of deep learning - a classic benchmark that proves a neural network can learn non-linear patterns through backpropagation.

Why XOR Matters

XOR (exclusive or) is not linearly separable. No single straight line can separate the classes:

    X2
    │
  1 │  ●(0,1)=1     ○(1,1)=0
    │
    ├───────────────────── X1
    │
  0 │  ○(0,0)=0     ●(1,0)=1
    │
        0           1

This means:

Perceptrons fail (single-layer networks)
Hidden layers required to create non-linear decision boundaries
Proves backpropagation works when the network learns XOR

The Mathematics

Truth Table

X1	X2	XOR Output
0	0	0
0	1	1
1	0	1
1	1	0

Network Architecture

Input(2) → Linear(2→8) → ReLU → Linear(8→1) → Sigmoid

Input layer: 2 features (X1, X2)
Hidden layer: 8 neurons with ReLU activation
Output layer: 1 neuron with Sigmoid (outputs probability)

Total parameters: 2×8 + 8 + 8×1 + 1 = 33

Implementation

use aprender::autograd::{clear_graph, Tensor};
use aprender::nn::{
    loss::MSELoss, optim::SGD, Linear, Module, Optimizer,
    ReLU, Sequential, Sigmoid,
};

fn main() {
    // XOR dataset
    let x = Tensor::new(&[
        0.0, 0.0,  // → 0
        0.0, 1.0,  // → 1
        1.0, 0.0,  // → 1
        1.0, 1.0,  // → 0
    ], &[4, 2]);

    let y = Tensor::new(&[0.0, 1.0, 1.0, 0.0], &[4, 1]);

    // Build network
    let mut model = Sequential::new()
        .add(Linear::with_seed(2, 8, Some(42)))
        .add(ReLU::new())
        .add(Linear::with_seed(8, 1, Some(43)))
        .add(Sigmoid::new());

    // Setup training
    let mut optimizer = SGD::new(model.parameters_mut(), 0.5);
    let loss_fn = MSELoss::new();

    // Training loop
    for epoch in 0..1000 {
        clear_graph();

        // Forward pass
        let x_grad = x.clone().requires_grad();
        let output = model.forward(&x_grad);

        // Compute loss
        let loss = loss_fn.forward(&output, &y);

        // Backward pass
        loss.backward();

        // Update weights
        let mut params = model.parameters_mut();
        optimizer.step_with_params(&mut params);
        optimizer.zero_grad();

        if epoch % 100 == 0 {
            println!("Epoch {}: Loss = {:.6}", epoch, loss.item());
        }
    }

    // Evaluate
    let final_output = model.forward(&x);
    println!("Predictions: {:?}", final_output.data());
}

Training Dynamics

Loss Curve

Epoch     Loss        Accuracy
─────────────────────────────
    0     0.304618      50%
  100     0.081109     100%
  200     0.013253     100%
  300     0.005368     100%
  500     0.002103     100%
 1000     0.000725     100%

The network:

Starts random (50% accuracy = random guessing)
Learns quickly (100% by epoch 100)
Refines confidence (loss continues decreasing)

Final Predictions

Input	Target	Prediction	Confidence
(0,0)	0	0.034	96.6%
(0,1)	1	0.977	97.7%
(1,0)	1	0.974	97.4%
(1,1)	0	0.023	97.7%

Key Concepts Demonstrated

1. Automatic Differentiation

loss.backward();  // Computes ∂L/∂w for all weights

The autograd engine:

Records operations during forward pass
Computes gradients in reverse (backpropagation)
Handles chain rule automatically

2. Non-Linear Activation

.add(ReLU::new())  // f(x) = max(0, x)

ReLU enables the network to learn non-linear decision boundaries. Without it, stacking linear layers would still be linear.

3. Gradient Descent

optimizer.step_with_params(&mut params);

Updates weights: w = w - lr × ∂L/∂w

With learning rate 0.5, the network converges in ~100 epochs.

Running the Example

cargo run --example xor_training

Exercises

Change hidden size: Try 4 or 16 neurons instead of 8
Change learning rate: What happens with lr=0.1 or lr=1.0?
Use Adam optimizer: Replace SGD with Adam
Add another hidden layer: Does it help or hurt?

Common Issues

Problem	Cause	Solution
Loss stuck at ~0.25	Vanishing gradients	Increase learning rate
Loss oscillates	Learning rate too high	Decrease learning rate
50% accuracy	Not learning	Check gradient flow

Theory: Universal Approximation

The XOR example demonstrates the Universal Approximation Theorem: a neural network with one hidden layer can approximate any continuous function, given enough neurons.

XOR requires learning a function like:

f(x1, x2) ≈ x1(1-x2) + x2(1-x1)

The hidden layer learns intermediate features that make this separable.

Next Steps

Classification Training - Multi-class with CrossEntropy
MNIST Digits - Real image classification (planned)

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning