Case Study: PCA Iris
This case study demonstrates Principal Component Analysis (PCA) for dimensionality reduction on the famous Iris dataset, reducing 4D flower measurements to 2D while preserving 96% of variance.
Overview
We'll apply PCA to Iris flower data to:
- Reduce 4 features (sepal/petal dimensions) to 2 principal components
- Analyze explained variance (how much information is preserved)
- Reconstruct original data and measure reconstruction error
- Understand principal component loadings (feature importance)
Running the Example
cargo run --example pca_iris
Expected output: Step-by-step PCA analysis including standardization, dimensionality reduction, explained variance analysis, transformed data samples, reconstruction quality, and principal component loadings.
Dataset
Iris Flower Measurements (30 samples)
// Features: [sepal_length, sepal_width, petal_length, petal_width]
// 10 samples each from: Setosa, Versicolor, Virginica
let data = Matrix::from_vec(30, 4, vec![
// Setosa (small petals, large sepals)
5.1, 3.5, 1.4, 0.2,
4.9, 3.0, 1.4, 0.2,
...
// Versicolor (medium petals and sepals)
7.0, 3.2, 4.7, 1.4,
6.4, 3.2, 4.5, 1.5,
...
// Virginica (large petals and sepals)
6.3, 3.3, 6.0, 2.5,
5.8, 2.7, 5.1, 1.9,
...
])?;
Dataset characteristics:
- 30 samples (10 per species)
- 4 features (all measurements in centimeters)
- 3 species with distinct morphological patterns
Step 1: Standardizing Features
Why Standardize?
PCA is sensitive to feature scales. Without standardization:
- Features with larger values dominate variance
- Example: Sepal length (4-8 cm) would dominate petal width (0.1-2.5 cm)
- Result: Principal components biased toward large-scale features
Implementation
use aprender::preprocessing::{StandardScaler, PCA};
use aprender::traits::Transformer;
let mut scaler = StandardScaler::new();
let scaled_data = scaler.fit_transform(&data)?;
StandardScaler transforms each feature to zero mean and unit variance:
X_scaled = (X - mean) / std
After standardization, all features contribute equally to PCA.
Step 2: Applying PCA (4D → 2D)
Dimensionality Reduction
let mut pca = PCA::new(2); // Keep 2 principal components
let transformed = pca.fit_transform(&scaled_data)?;
println!("Original shape: {:?}", data.shape()); // (30, 4)
println!("Reduced shape: {:?}", transformed.shape()); // (30, 2)
What happens during fit:
- Compute covariance matrix: Σ = (X^T X) / (n-1)
- Eigendecomposition: Σ v_i = λ_i v_i
- Sort eigenvectors by eigenvalue (descending)
- Keep top 2 eigenvectors as principal components
Transform projects data onto principal components:
X_pca = (X - mean) @ components^T
Step 3: Explained Variance Analysis
Results
Explained Variance by Component:
PC1: 2.9501 (71.29%) ███████████████████████████████████
PC2: 1.0224 (24.71%) ████████████
Total Variance Captured: 96.00%
Information Lost: 4.00%
Interpretation
PC1 (71.29% variance):
- Captures overall flower size
- Dominant direction of variation
- Likely separates Setosa (small) from Virginica (large)
PC2 (24.71% variance):
- Captures petal vs sepal differences
- Secondary variation pattern
- Likely separates Versicolor from other species
96% total variance: Excellent dimensionality reduction
- Only 4% information loss
- 2D representation sufficient for visualization
- Suitable for downstream ML tasks
Variance Ratios
let explained_var = pca.explained_variance()?;
let explained_ratio = pca.explained_variance_ratio()?;
for (i, (&var, &ratio)) in explained_var.iter()
.zip(explained_ratio.iter()).enumerate() {
println!("PC{}: variance={:.4}, ratio={:.2}%",
i+1, var, ratio*100.0);
}
Eigenvalues (explained_variance):
- PC1: 2.9501 (variance captured)
- PC2: 1.0224
- Sum ≈ 4.0 (total variance of standardized data)
Ratios sum to 1.0: All variance accounted for.
Step 4: Transformed Data
Sample Output
Sample Species PC1 PC2
────────────────────────────────────────────
0 Setosa -2.2055 -0.8904
1 Setosa -2.0411 0.4635
10 Versicolor 0.9644 -0.8293
11 Versicolor 0.6384 -0.6166
20 Virginica 1.7447 -0.8603
21 Virginica 1.0657 0.8717
Visual Separation
PC1 axis (horizontal):
- Setosa: Negative values (~-2.2)
- Versicolor: Slightly positive (~0.8)
- Virginica: Positive values (~1.5)
PC2 axis (vertical):
- All species: Values range from -1 to +1
- Less separable than PC1
Conclusion: 2D projection enables easy visualization and classification of species.
Step 5: Reconstruction (2D → 4D)
Implementation
let reconstructed_scaled = pca.inverse_transform(&transformed)?;
let reconstructed = scaler.inverse_transform(&reconstructed_scaled)?;
Inverse transform:
X_reconstructed = X_pca @ components^T + mean
Reconstruction Error
Reconstruction Error Metrics:
MSE: 0.033770
RMSE: 0.183767
Max Error: 0.699232
Sample Reconstruction:
Feature Original Reconstructed
──────────────────────────────────
Sample 0:
Sepal L 5.1000 5.0208 (error: -0.08 cm)
Sepal W 3.5000 3.5107 (error: +0.01 cm)
Petal L 1.4000 1.4504 (error: +0.05 cm)
Petal W 0.2000 0.2462 (error: +0.05 cm)
Interpretation
RMSE = 0.184:
- Average reconstruction error is 0.184 cm
- Small compared to feature ranges (0.2-10 cm)
- Demonstrates 2D representation preserves most information
Max error = 0.70 cm:
- Worst-case reconstruction error
- Still reasonable for biological measurements
- Validates 96% variance capture claim
Why not perfect reconstruction?
- 2 components < 4 original features
- 4% variance discarded
- Trade-off: compression vs accuracy
Step 6: Principal Component Loadings
Feature Importance
Component Sepal L Sepal W Petal L Petal W
──────────────────────────────────────────────────────
PC1 0.5310 -0.2026 0.5901 0.5734
PC2 -0.3407 -0.9400 0.0033 -0.0201
Interpretation
PC1 (overall size):
- Positive loadings: Sepal L (0.53), Petal L (0.59), Petal W (0.57)
- Negative loading: Sepal W (-0.20)
- Meaning: Larger flowers score high on PC1
- Separates Setosa (small) vs Virginica (large)
PC2 (petal vs sepal differences):
- Strong negative: Sepal W (-0.94)
- Near-zero: Petal L (0.003), Petal W (-0.02)
- Meaning: Captures sepal width variation
- Separates species by sepal shape
Mathematical Properties
Orthogonality: PC1 ⊥ PC2
let components = pca.components()?;
let dot_product = (0..4).map(|k| {
components.get(0, k) * components.get(1, k)
}).sum::<f32>();
assert!(dot_product.abs() < 1e-6); // ≈ 0
Unit length: ‖v_i‖ = 1
let norm_sq = (0..4).map(|k| {
let val = components.get(0, k);
val * val
}).sum::<f32>();
assert!((norm_sq.sqrt() - 1.0).abs() < 1e-6); // ≈ 1
Performance Metrics
Time Complexity
| Operation | Iris Dataset | General (n×p) |
|---|---|---|
| Standardization | 0.12 ms | O(n·p) |
| Covariance | 0.05 ms | O(p²·n) |
| Eigendecomposition | 0.03 ms | O(p³) |
| Transform | 0.02 ms | O(n·k·p) |
| Total | 0.22 ms | O(p³ + p²·n) |
Bottleneck: Eigendecomposition O(p³)
- Iris: p=4, very fast (0.03 ms)
- High-dimensional: p>10,000, use truncated SVD
Memory Usage
Iris example:
- Centered data: 30×4×4 = 480 bytes
- Covariance matrix: 4×4×4 = 64 bytes
- Components stored: 2×4×4 = 32 bytes
- Total: ~576 bytes
General formula: 4(n·p + p²) bytes
Key Takeaways
When to Use PCA
✓ Visualization: Reduce to 2D/3D for plotting
✓ Preprocessing: Remove correlated features before ML
✓ Compression: Reduce storage by 50%+ with minimal information loss
✓ Denoising: Discard low-variance (noisy) components
PCA Assumptions
- Linear relationships: PCA captures linear structure only
- Variance = importance: High-variance directions are informative
- Standardization required: Features must be on similar scales
- Orthogonal components: Each PC independent of others
Best Practices
- Always standardize before PCA (unless features already scaled)
- Check explained variance: Aim for 90-95% cumulative
- Interpret loadings: Understand what each PC represents
- Validate reconstruction: Low RMSE confirms quality
- Visualize 2D projection: Verify species separation
Full Code
use aprender::preprocessing::{StandardScaler, PCA};
use aprender::primitives::Matrix;
use aprender::traits::Transformer;
// 1. Load data
let data = Matrix::from_vec(30, 4, iris_data)?;
// 2. Standardize
let mut scaler = StandardScaler::new();
let scaled = scaler.fit_transform(&data)?;
// 3. Apply PCA
let mut pca = PCA::new(2);
let reduced = pca.fit_transform(&scaled)?;
// 4. Analyze variance
let var_ratio = pca.explained_variance_ratio().unwrap();
println!("Variance: {:.1}%", var_ratio.iter().sum::<f32>() * 100.0);
// 5. Reconstruct
let reconstructed_scaled = pca.inverse_transform(&reduced)?;
let reconstructed = scaler.inverse_transform(&reconstructed_scaled)?;
// 6. Compute error
let rmse = compute_rmse(&data, &reconstructed);
println!("RMSE: {:.4}", rmse);
Further Exploration
Try different n_components:
let mut pca1 = PCA::new(1); // ~71% variance
let mut pca3 = PCA::new(3); // ~99% variance
let mut pca4 = PCA::new(4); // 100% variance (perfect reconstruction)
Analyze per-species variance:
- Compute PCA separately for each species
- Compare principal directions
- Identify species-specific variation patterns
Compare with other methods:
- LDA: Supervised dimensionality reduction (uses labels)
- t-SNE: Non-linear visualization (preserves local structure)
- UMAP: Non-linear, faster than t-SNE
Related Examples
examples/iris_clustering.rs- K-Means on same datasetbook/src/ml-fundamentals/pca.md- Full PCA theory