Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a fundamental dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional representation while preserving as much variance as possible. This chapter covers the theory, implementation, and practical considerations for using PCA in aprender.

Why Dimensionality Reduction?

High-dimensional data presents several challenges:

Curse of dimensionality: Distance metrics become less meaningful in high dimensions
Visualization: Impossible to visualize data beyond 3D
Computational cost: Training time grows with dimensionality
Overfitting: More features increase risk of spurious correlations
Storage: High-dimensional data requires more memory

PCA addresses these challenges by finding a lower-dimensional subspace that captures most of the data's variance.

Mathematical Foundation

Core Idea

PCA finds orthogonal directions (principal components) along which data varies the most. These directions are the eigenvectors of the covariance matrix.

Steps:

Center the data (subtract mean)
Compute covariance matrix
Find eigenvalues and eigenvectors
Project data onto top-k eigenvectors

Covariance Matrix

For centered data matrix X (n samples × p features):

Σ = (X^T X) / (n - 1)

The covariance matrix Σ is:

Symmetric: Σ = Σ^T
Positive semi-definite: all eigenvalues ≥ 0
Size: p × p (independent of n)

Eigendecomposition

The eigenvectors of Σ form the principal components:

Σ v_i = λ_i v_i

where:

v_i = i-th principal component (eigenvector)
λ_i = variance explained by v_i (eigenvalue)

Key properties:

Eigenvectors are orthogonal: v_i ⊥ v_j for i ≠ j
Eigenvalues sum to total variance: Σ λ_i = trace(Σ)
Components ordered by decreasing eigenvalue

Projection

To project data onto k principal components:

X_pca = (X - μ) W_k

where:
  μ = column means
  W_k = [v_1, v_2, ..., v_k]  (p × k matrix)

Reconstruction

To reconstruct original space from reduced dimensions:

X_reconstructed = X_pca W_k^T + μ

Perfect reconstruction when k = p (all components kept).

Implementation in Aprender

Basic Usage

use aprender::preprocessing::{PCA, StandardScaler};
use aprender::traits::Transformer;
use aprender::primitives::Matrix;

// Always standardize first (PCA is scale-sensitive)
let mut scaler = StandardScaler::new();
let scaled_data = scaler.fit_transform(&data)?;

// Reduce from 4D to 2D
let mut pca = PCA::new(2);
let reduced = pca.fit_transform(&scaled_data)?;

// Analyze explained variance
let var_ratio = pca.explained_variance_ratio().unwrap();
println!("PC1 explains {:.1}%", var_ratio[0] * 100.0);
println!("PC2 explains {:.1}%", var_ratio[1] * 100.0);

// Reconstruct original space
let reconstructed = pca.inverse_transform(&reduced)?;

Transformer Trait

PCA implements the Transformer trait:

pub trait Transformer {
    fn fit(&mut self, x: &Matrix<f32>) -> Result<(), &'static str>;
    fn transform(&self, x: &Matrix<f32>) -> Result<Matrix<f32>, &'static str>;
    fn fit_transform(&mut self, x: &Matrix<f32>) -> Result<Matrix<f32>, &'static str> {
        self.fit(x)?;
        self.transform(x)
    }
}

This enables:

Fit on training data → Learn components
Transform test data → Apply same projection
Pipeline compatibility → Chain with other transformers

Explained Variance

let explained_var = pca.explained_variance().unwrap();
let explained_ratio = pca.explained_variance_ratio().unwrap();

// Cumulative variance
let mut cumsum = 0.0;
for (i, ratio) in explained_ratio.iter().enumerate() {
    cumsum += ratio;
    println!("PC{}: {:.2}% (cumulative: {:.2}%)",
             i+1, ratio*100.0, cumsum*100.0);
}

Rule of thumb: Keep components until 90-95% variance explained.

Principal Components (Loadings)

let components = pca.components().unwrap();
let (n_components, n_features) = components.shape();

for i in 0..n_components {
    println!("PC{} loadings:", i+1);
    for j in 0..n_features {
        println!("  Feature {}: {:.4}", j, components.get(i, j));
    }
}

Interpretation:

Larger absolute values = more important for that component
Sign indicates direction of influence
Orthogonal components capture different variation patterns

Time and Space Complexity

Computational Cost

Operation	Time Complexity	Space Complexity
Center data	O(n · p)	O(n · p)
Covariance matrix	O(p² · n)	O(p²)
Eigendecomposition	O(p³)	O(p²)
Transform	O(n · k · p)	O(n · k)
Inverse transform	O(n · k · p)	O(n · p)

where:

n = number of samples
p = number of features
k = number of components

Bottleneck: Eigendecomposition is O(p³), making PCA impractical for p > 10,000 without specialized methods (truncated SVD, randomized PCA).

Memory Requirements

During fit:

Centered data: 4n·p bytes (f32)
Covariance matrix: 4p² bytes
Eigenvectors: 4k·p bytes (stored components)
Total: ~4(n·p + p²) bytes

Example (1000 samples, 100 features):

0.4 MB centered data
0.04 MB covariance
Total: ~0.44 MB

Scaling: Memory dominated by n·p term for large datasets.

Choosing the Number of Components

Methods

Variance threshold: Keep components explaining ≥ 90% variance

let ratios = pca.explained_variance_ratio().unwrap();
let mut cumsum = 0.0;
let mut k = 0;
for ratio in ratios {
    cumsum += ratio;
    k += 1;
    if cumsum >= 0.90 {
        break;
    }
}
println!("Need {} components for 90% variance", k);

Scree plot: Look for "elbow" where eigenvalues plateau
Kaiser criterion: Keep components with eigenvalue > 1.0
Domain knowledge: Use as many components as interpretable

Tradeoffs

Fewer Components	More Components
Faster training	Better reconstruction
Less overfitting risk	Preserves subtle patterns
Simpler models	Higher computational cost
Information loss	Potential overfitting

When to Use PCA

Good Use Cases

✓ Visualization: Reduce to 2D/3D for plotting ✓ Preprocessing: Remove correlated features before ML ✓ Compression: Reduce storage for large datasets ✓ Denoising: Remove low-variance (noisy) dimensions ✓ Regularization: Prevent overfitting in high dimensions

When PCA Fails

✗ Non-linear structure: PCA only captures linear relationships ✗ Outliers: Covariance sensitive to extreme values ✗ Sparse data: Text/categorical data better handled by other methods ✗ Interpretability required: Principal components are linear combinations ✗ Class separation not along high-variance directions: Use LDA instead

Algorithm Details

Eigendecomposition Implementation

Aprender uses nalgebra's SymmetricEigen for covariance matrix eigendecomposition:

use nalgebra::{DMatrix, SymmetricEigen};

let cov_matrix = DMatrix::from_row_slice(n_features, n_features, &cov);
let eigen = SymmetricEigen::new(cov_matrix);

let eigenvalues = eigen.eigenvalues;   // sorted ascending by default
let eigenvectors = eigen.eigenvectors; // corresponding eigenvectors

Why SymmetricEigen?

Covariance matrices are symmetric positive semi-definite
Specialized algorithms (Jacobi, LAPACK SYEV) exploit symmetry
Guarantees real eigenvalues and orthogonal eigenvectors
More numerically stable than general eigendecomposition

Numerical Stability

Potential issues:

Catastrophic cancellation: Subtracting nearly-equal numbers in covariance
Eigenvalue precision: Small eigenvalues may be computed inaccurately
Degeneracy: Multiple eigenvalues ≈ λ lead to non-unique eigenvectors

Aprender's approach:

Use f32 (single precision) for memory efficiency
Center data before covariance to reduce magnitude differences
Sort eigenvalues/vectors explicitly (not relying on solver ordering)
Components normalized to unit length (‖v_i‖ = 1)

Standardization Best Practice

Always standardize before PCA:

let mut scaler = StandardScaler::new();
let scaled = scaler.fit_transform(&data)?;
let mut pca = PCA::new(n_components);
let reduced = pca.fit_transform(&scaled)?;

Why?

Features with larger scales dominate variance
Example: Age (0-100) vs Income ($0-$1M) → Income dominates
Standardization ensures each feature contributes equally

When not to standardize:

Features already on same scale (e.g., all pixel intensities 0-255)
Domain knowledge suggests unequal weighting is correct

Comparison with Other Methods

Method	Linear?	Supervised?	Preserves	Use Case
PCA	Yes	No	Variance	Unsupervised, visualization
LDA	Yes	Yes	Class separation	Classification preprocessing
t-SNE	No	No	Local structure	Visualization only
Autoencoders	No	No	Reconstruction	Non-linear compression
Feature selection	N/A	Optional	Original features	Interpretability

PCA advantages:

Fast (closed-form solution)
Deterministic (no random initialization)
Interpretable components (linear combinations)
Mathematical guarantees (optimal variance preservation)

Example: Iris Dataset

Complete example from examples/pca_iris.rs:

use aprender::preprocessing::{PCA, StandardScaler};
use aprender::traits::Transformer;

// 1. Standardize
let mut scaler = StandardScaler::new();
let scaled = scaler.fit_transform(&iris_data)?;

// 2. Apply PCA (4D → 2D)
let mut pca = PCA::new(2);
let reduced = pca.fit_transform(&scaled)?;

// 3. Analyze results
let var_ratio = pca.explained_variance_ratio().unwrap();
println!("Variance captured: {:.1}%",
         var_ratio.iter().sum::<f32>() * 100.0);

// 4. Reconstruct
let reconstructed_scaled = pca.inverse_transform(&reduced)?;
let reconstructed = scaler.inverse_transform(&reconstructed_scaled)?;

// 5. Compute reconstruction error
let rmse = compute_rmse(&iris_data, &reconstructed);
println!("Reconstruction RMSE: {:.4}", rmse);

Typical results:

PC1 + PC2 capture ~96% of Iris variance
2D projection enables visualization of 3 species
RMSE ≈ 0.18 (small reconstruction error)

API Reference

// Constructor
pub fn new(n_components: usize) -> Self

// Transformer trait
fn fit(&mut self, x: &Matrix<f32>) -> Result<(), &'static str>
fn transform(&self, x: &Matrix<f32>) -> Result<Matrix<f32>, &'static str>

// Accessors
pub fn explained_variance(&self) -> Option<&[f32]>
pub fn explained_variance_ratio(&self) -> Option<&[f32]>
pub fn components(&self) -> Option<&Matrix<f32>>

// Reconstruction
pub fn inverse_transform(&self, x: &Matrix<f32>) -> Result<Matrix<f32>, &'static str>

See also:

preprocessing::StandardScaler - Always use before PCA
examples/pca_iris.rs - Complete walkthrough
traits::Transformer - Composable preprocessing pipeline

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning