Case Study: PCA Iris

This case study demonstrates Principal Component Analysis (PCA) for dimensionality reduction on the famous Iris dataset, reducing 4D flower measurements to 2D while preserving 96% of variance.

Overview

We'll apply PCA to Iris flower data to:

Reduce 4 features (sepal/petal dimensions) to 2 principal components
Analyze explained variance (how much information is preserved)
Reconstruct original data and measure reconstruction error
Understand principal component loadings (feature importance)

Running the Example

cargo run --example pca_iris

Expected output: Step-by-step PCA analysis including standardization, dimensionality reduction, explained variance analysis, transformed data samples, reconstruction quality, and principal component loadings.

Dataset

Iris Flower Measurements (30 samples)

// Features: [sepal_length, sepal_width, petal_length, petal_width]
// 10 samples each from: Setosa, Versicolor, Virginica

let data = Matrix::from_vec(30, 4, vec![
    // Setosa (small petals, large sepals)
    5.1, 3.5, 1.4, 0.2,
    4.9, 3.0, 1.4, 0.2,
    ...
    // Versicolor (medium petals and sepals)
    7.0, 3.2, 4.7, 1.4,
    6.4, 3.2, 4.5, 1.5,
    ...
    // Virginica (large petals and sepals)
    6.3, 3.3, 6.0, 2.5,
    5.8, 2.7, 5.1, 1.9,
    ...
])?;

Dataset characteristics:

30 samples (10 per species)
4 features (all measurements in centimeters)
3 species with distinct morphological patterns

Step 1: Standardizing Features

Why Standardize?

PCA is sensitive to feature scales. Without standardization:

Features with larger values dominate variance
Example: Sepal length (4-8 cm) would dominate petal width (0.1-2.5 cm)
Result: Principal components biased toward large-scale features

Implementation

use aprender::preprocessing::{StandardScaler, PCA};
use aprender::traits::Transformer;

let mut scaler = StandardScaler::new();
let scaled_data = scaler.fit_transform(&data)?;

StandardScaler transforms each feature to zero mean and unit variance:

X_scaled = (X - mean) / std

After standardization, all features contribute equally to PCA.

Step 2: Applying PCA (4D → 2D)

Dimensionality Reduction

let mut pca = PCA::new(2); // Keep 2 principal components
let transformed = pca.fit_transform(&scaled_data)?;

println!("Original shape: {:?}", data.shape());       // (30, 4)
println!("Reduced shape: {:?}", transformed.shape()); // (30, 2)

What happens during fit:

Compute covariance matrix: Σ = (X^T X) / (n-1)
Eigendecomposition: Σ v_i = λ_i v_i
Sort eigenvectors by eigenvalue (descending)
Keep top 2 eigenvectors as principal components

Transform projects data onto principal components:

X_pca = (X - mean) @ components^T

Step 3: Explained Variance Analysis

Results

Explained Variance by Component:
   PC1: 2.9501 (71.29%) ███████████████████████████████████
   PC2: 1.0224 (24.71%) ████████████

Total Variance Captured: 96.00%
Information Lost:        4.00%

Interpretation

PC1 (71.29% variance):

Captures overall flower size
Dominant direction of variation
Likely separates Setosa (small) from Virginica (large)

PC2 (24.71% variance):

Captures petal vs sepal differences
Secondary variation pattern
Likely separates Versicolor from other species

96% total variance: Excellent dimensionality reduction

Only 4% information loss
2D representation sufficient for visualization
Suitable for downstream ML tasks

Variance Ratios

let explained_var = pca.explained_variance()?;
let explained_ratio = pca.explained_variance_ratio()?;

for (i, (&var, &ratio)) in explained_var.iter()
                             .zip(explained_ratio.iter()).enumerate() {
    println!("PC{}: variance={:.4}, ratio={:.2}%",
             i+1, var, ratio*100.0);
}

Eigenvalues (explained_variance):

PC1: 2.9501 (variance captured)
PC2: 1.0224
Sum ≈ 4.0 (total variance of standardized data)

Ratios sum to 1.0: All variance accounted for.

Step 4: Transformed Data

Sample Output

Sample      Species        PC1        PC2
────────────────────────────────────────────
     0       Setosa    -2.2055    -0.8904
     1       Setosa    -2.0411     0.4635
    10   Versicolor     0.9644    -0.8293
    11   Versicolor     0.6384    -0.6166
    20    Virginica     1.7447    -0.8603
    21    Virginica     1.0657     0.8717

Visual Separation

PC1 axis (horizontal):

Setosa: Negative values (~-2.2)
Versicolor: Slightly positive (~0.8)
Virginica: Positive values (~1.5)

PC2 axis (vertical):

All species: Values range from -1 to +1
Less separable than PC1

Conclusion: 2D projection enables easy visualization and classification of species.

Step 5: Reconstruction (2D → 4D)

Implementation

let reconstructed_scaled = pca.inverse_transform(&transformed)?;
let reconstructed = scaler.inverse_transform(&reconstructed_scaled)?;

Inverse transform:

X_reconstructed = X_pca @ components^T + mean

Reconstruction Error

Reconstruction Error Metrics:
   MSE:        0.033770
   RMSE:       0.183767
   Max Error:  0.699232

Sample Reconstruction:

Feature   Original  Reconstructed
──────────────────────────────────
Sample 0:
 Sepal L     5.1000         5.0208  (error: -0.08 cm)
 Sepal W     3.5000         3.5107  (error: +0.01 cm)
 Petal L     1.4000         1.4504  (error: +0.05 cm)
 Petal W     0.2000         0.2462  (error: +0.05 cm)

Interpretation

RMSE = 0.184:

Average reconstruction error is 0.184 cm
Small compared to feature ranges (0.2-10 cm)
Demonstrates 2D representation preserves most information

Max error = 0.70 cm:

Worst-case reconstruction error
Still reasonable for biological measurements
Validates 96% variance capture claim

Why not perfect reconstruction?

2 components < 4 original features
4% variance discarded
Trade-off: compression vs accuracy

Step 6: Principal Component Loadings

Feature Importance

 Component    Sepal L    Sepal W    Petal L    Petal W
──────────────────────────────────────────────────────
       PC1     0.5310    -0.2026     0.5901     0.5734
       PC2    -0.3407    -0.9400     0.0033    -0.0201

Interpretation

PC1 (overall size):

Positive loadings: Sepal L (0.53), Petal L (0.59), Petal W (0.57)
Negative loading: Sepal W (-0.20)
Meaning: Larger flowers score high on PC1
Separates Setosa (small) vs Virginica (large)

PC2 (petal vs sepal differences):

Strong negative: Sepal W (-0.94)
Near-zero: Petal L (0.003), Petal W (-0.02)
Meaning: Captures sepal width variation
Separates species by sepal shape

Mathematical Properties

Orthogonality: PC1 ⊥ PC2

let components = pca.components()?;
let dot_product = (0..4).map(|k| {
    components.get(0, k) * components.get(1, k)
}).sum::<f32>();
assert!(dot_product.abs() < 1e-6); // ≈ 0

Unit length: ‖v_i‖ = 1

let norm_sq = (0..4).map(|k| {
    let val = components.get(0, k);
    val * val
}).sum::<f32>();
assert!((norm_sq.sqrt() - 1.0).abs() < 1e-6); // ≈ 1

Performance Metrics

Time Complexity

Operation	Iris Dataset	General (n×p)
Standardization	0.12 ms	O(n·p)
Covariance	0.05 ms	O(p²·n)
Eigendecomposition	0.03 ms	O(p³)
Transform	0.02 ms	O(n·k·p)
Total	0.22 ms	O(p³ + p²·n)

Bottleneck: Eigendecomposition O(p³)

Iris: p=4, very fast (0.03 ms)
High-dimensional: p>10,000, use truncated SVD

Memory Usage

Iris example:

Centered data: 30×4×4 = 480 bytes
Covariance matrix: 4×4×4 = 64 bytes
Components stored: 2×4×4 = 32 bytes
Total: ~576 bytes

General formula: 4(n·p + p²) bytes

Key Takeaways

When to Use PCA

✓ Visualization: Reduce to 2D/3D for plotting
✓ Preprocessing: Remove correlated features before ML
✓ Compression: Reduce storage by 50%+ with minimal information loss
✓ Denoising: Discard low-variance (noisy) components

PCA Assumptions

Linear relationships: PCA captures linear structure only
Variance = importance: High-variance directions are informative
Standardization required: Features must be on similar scales
Orthogonal components: Each PC independent of others

Best Practices

Always standardize before PCA (unless features already scaled)
Check explained variance: Aim for 90-95% cumulative
Interpret loadings: Understand what each PC represents
Validate reconstruction: Low RMSE confirms quality
Visualize 2D projection: Verify species separation

Full Code

use aprender::preprocessing::{StandardScaler, PCA};
use aprender::primitives::Matrix;
use aprender::traits::Transformer;

// 1. Load data
let data = Matrix::from_vec(30, 4, iris_data)?;

// 2. Standardize
let mut scaler = StandardScaler::new();
let scaled = scaler.fit_transform(&data)?;

// 3. Apply PCA
let mut pca = PCA::new(2);
let reduced = pca.fit_transform(&scaled)?;

// 4. Analyze variance
let var_ratio = pca.explained_variance_ratio().unwrap();
println!("Variance: {:.1}%", var_ratio.iter().sum::<f32>() * 100.0);

// 5. Reconstruct
let reconstructed_scaled = pca.inverse_transform(&reduced)?;
let reconstructed = scaler.inverse_transform(&reconstructed_scaled)?;

// 6. Compute error
let rmse = compute_rmse(&data, &reconstructed);
println!("RMSE: {:.4}", rmse);

Further Exploration

Try different n_components:

let mut pca1 = PCA::new(1);  // ~71% variance
let mut pca3 = PCA::new(3);  // ~99% variance
let mut pca4 = PCA::new(4);  // 100% variance (perfect reconstruction)

Analyze per-species variance:

Compute PCA separately for each species
Compare principal directions
Identify species-specific variation patterns

Compare with other methods:

LDA: Supervised dimensionality reduction (uses labels)
t-SNE: Non-linear visualization (preserves local structure)
UMAP: Non-linear, faster than t-SNE

examples/iris_clustering.rs - K-Means on same dataset
book/src/ml-fundamentals/pca.md - Full PCA theory

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning