Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a fundamental dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional representation while preserving as much variance as possible. This chapter covers the theory, implementation, and practical considerations for using PCA in aprender.

Why Dimensionality Reduction?

High-dimensional data presents several challenges:

  • Curse of dimensionality: Distance metrics become less meaningful in high dimensions
  • Visualization: Impossible to visualize data beyond 3D
  • Computational cost: Training time grows with dimensionality
  • Overfitting: More features increase risk of spurious correlations
  • Storage: High-dimensional data requires more memory

PCA addresses these challenges by finding a lower-dimensional subspace that captures most of the data's variance.

Mathematical Foundation

Core Idea

PCA finds orthogonal directions (principal components) along which data varies the most. These directions are the eigenvectors of the covariance matrix.

Steps:

  1. Center the data (subtract mean)
  2. Compute covariance matrix
  3. Find eigenvalues and eigenvectors
  4. Project data onto top-k eigenvectors

Covariance Matrix

For centered data matrix X (n samples × p features):

Σ = (X^T X) / (n - 1)

The covariance matrix Σ is:

  • Symmetric: Σ = Σ^T
  • Positive semi-definite: all eigenvalues ≥ 0
  • Size: p × p (independent of n)

Eigendecomposition

The eigenvectors of Σ form the principal components:

Σ v_i = λ_i v_i

where:

  • v_i = i-th principal component (eigenvector)
  • λ_i = variance explained by v_i (eigenvalue)

Key properties:

  • Eigenvectors are orthogonal: v_i ⊥ v_j for i ≠ j
  • Eigenvalues sum to total variance: Σ λ_i = trace(Σ)
  • Components ordered by decreasing eigenvalue

Projection

To project data onto k principal components:

X_pca = (X - μ) W_k

where:
  μ = column means
  W_k = [v_1, v_2, ..., v_k]  (p × k matrix)

Reconstruction

To reconstruct original space from reduced dimensions:

X_reconstructed = X_pca W_k^T + μ

Perfect reconstruction when k = p (all components kept).

Implementation in Aprender

Basic Usage

use aprender::preprocessing::{PCA, StandardScaler};
use aprender::traits::Transformer;
use aprender::primitives::Matrix;

// Always standardize first (PCA is scale-sensitive)
let mut scaler = StandardScaler::new();
let scaled_data = scaler.fit_transform(&data)?;

// Reduce from 4D to 2D
let mut pca = PCA::new(2);
let reduced = pca.fit_transform(&scaled_data)?;

// Analyze explained variance
let var_ratio = pca.explained_variance_ratio().unwrap();
println!("PC1 explains {:.1}%", var_ratio[0] * 100.0);
println!("PC2 explains {:.1}%", var_ratio[1] * 100.0);

// Reconstruct original space
let reconstructed = pca.inverse_transform(&reduced)?;

Transformer Trait

PCA implements the Transformer trait:

pub trait Transformer {
    fn fit(&mut self, x: &Matrix<f32>) -> Result<(), &'static str>;
    fn transform(&self, x: &Matrix<f32>) -> Result<Matrix<f32>, &'static str>;
    fn fit_transform(&mut self, x: &Matrix<f32>) -> Result<Matrix<f32>, &'static str> {
        self.fit(x)?;
        self.transform(x)
    }
}

This enables:

  • Fit on training data → Learn components
  • Transform test data → Apply same projection
  • Pipeline compatibility → Chain with other transformers

Explained Variance

let explained_var = pca.explained_variance().unwrap();
let explained_ratio = pca.explained_variance_ratio().unwrap();

// Cumulative variance
let mut cumsum = 0.0;
for (i, ratio) in explained_ratio.iter().enumerate() {
    cumsum += ratio;
    println!("PC{}: {:.2}% (cumulative: {:.2}%)",
             i+1, ratio*100.0, cumsum*100.0);
}

Rule of thumb: Keep components until 90-95% variance explained.

Principal Components (Loadings)

let components = pca.components().unwrap();
let (n_components, n_features) = components.shape();

for i in 0..n_components {
    println!("PC{} loadings:", i+1);
    for j in 0..n_features {
        println!("  Feature {}: {:.4}", j, components.get(i, j));
    }
}

Interpretation:

  • Larger absolute values = more important for that component
  • Sign indicates direction of influence
  • Orthogonal components capture different variation patterns

Time and Space Complexity

Computational Cost

OperationTime ComplexitySpace Complexity
Center dataO(n · p)O(n · p)
Covariance matrixO(p² · n)O(p²)
EigendecompositionO(p³)O(p²)
TransformO(n · k · p)O(n · k)
Inverse transformO(n · k · p)O(n · p)

where:

  • n = number of samples
  • p = number of features
  • k = number of components

Bottleneck: Eigendecomposition is O(p³), making PCA impractical for p > 10,000 without specialized methods (truncated SVD, randomized PCA).

Memory Requirements

During fit:

  • Centered data: 4n·p bytes (f32)
  • Covariance matrix: 4p² bytes
  • Eigenvectors: 4k·p bytes (stored components)
  • Total: ~4(n·p + p²) bytes

Example (1000 samples, 100 features):

  • 0.4 MB centered data
  • 0.04 MB covariance
  • Total: ~0.44 MB

Scaling: Memory dominated by n·p term for large datasets.

Choosing the Number of Components

Methods

  1. Variance threshold: Keep components explaining ≥ 90% variance
let ratios = pca.explained_variance_ratio().unwrap();
let mut cumsum = 0.0;
let mut k = 0;
for ratio in ratios {
    cumsum += ratio;
    k += 1;
    if cumsum >= 0.90 {
        break;
    }
}
println!("Need {} components for 90% variance", k);
  1. Scree plot: Look for "elbow" where eigenvalues plateau

  2. Kaiser criterion: Keep components with eigenvalue > 1.0

  3. Domain knowledge: Use as many components as interpretable

Tradeoffs

Fewer ComponentsMore Components
Faster trainingBetter reconstruction
Less overfitting riskPreserves subtle patterns
Simpler modelsHigher computational cost
Information lossPotential overfitting

When to Use PCA

Good Use Cases

Visualization: Reduce to 2D/3D for plotting ✓ Preprocessing: Remove correlated features before ML ✓ Compression: Reduce storage for large datasets ✓ Denoising: Remove low-variance (noisy) dimensions ✓ Regularization: Prevent overfitting in high dimensions

When PCA Fails

Non-linear structure: PCA only captures linear relationships ✗ Outliers: Covariance sensitive to extreme values ✗ Sparse data: Text/categorical data better handled by other methods ✗ Interpretability required: Principal components are linear combinations ✗ Class separation not along high-variance directions: Use LDA instead

Algorithm Details

Eigendecomposition Implementation

Aprender uses nalgebra's SymmetricEigen for covariance matrix eigendecomposition:

use nalgebra::{DMatrix, SymmetricEigen};

let cov_matrix = DMatrix::from_row_slice(n_features, n_features, &cov);
let eigen = SymmetricEigen::new(cov_matrix);

let eigenvalues = eigen.eigenvalues;   // sorted ascending by default
let eigenvectors = eigen.eigenvectors; // corresponding eigenvectors

Why SymmetricEigen?

  • Covariance matrices are symmetric positive semi-definite
  • Specialized algorithms (Jacobi, LAPACK SYEV) exploit symmetry
  • Guarantees real eigenvalues and orthogonal eigenvectors
  • More numerically stable than general eigendecomposition

Numerical Stability

Potential issues:

  1. Catastrophic cancellation: Subtracting nearly-equal numbers in covariance
  2. Eigenvalue precision: Small eigenvalues may be computed inaccurately
  3. Degeneracy: Multiple eigenvalues ≈ λ lead to non-unique eigenvectors

Aprender's approach:

  • Use f32 (single precision) for memory efficiency
  • Center data before covariance to reduce magnitude differences
  • Sort eigenvalues/vectors explicitly (not relying on solver ordering)
  • Components normalized to unit length (‖v_i‖ = 1)

Standardization Best Practice

Always standardize before PCA:

let mut scaler = StandardScaler::new();
let scaled = scaler.fit_transform(&data)?;
let mut pca = PCA::new(n_components);
let reduced = pca.fit_transform(&scaled)?;

Why?

  • Features with larger scales dominate variance
  • Example: Age (0-100) vs Income ($0-$1M) → Income dominates
  • Standardization ensures each feature contributes equally

When not to standardize:

  • Features already on same scale (e.g., all pixel intensities 0-255)
  • Domain knowledge suggests unequal weighting is correct

Comparison with Other Methods

MethodLinear?Supervised?PreservesUse Case
PCAYesNoVarianceUnsupervised, visualization
LDAYesYesClass separationClassification preprocessing
t-SNENoNoLocal structureVisualization only
AutoencodersNoNoReconstructionNon-linear compression
Feature selectionN/AOptionalOriginal featuresInterpretability

PCA advantages:

  • Fast (closed-form solution)
  • Deterministic (no random initialization)
  • Interpretable components (linear combinations)
  • Mathematical guarantees (optimal variance preservation)

Example: Iris Dataset

Complete example from examples/pca_iris.rs:

use aprender::preprocessing::{PCA, StandardScaler};
use aprender::traits::Transformer;

// 1. Standardize
let mut scaler = StandardScaler::new();
let scaled = scaler.fit_transform(&iris_data)?;

// 2. Apply PCA (4D → 2D)
let mut pca = PCA::new(2);
let reduced = pca.fit_transform(&scaled)?;

// 3. Analyze results
let var_ratio = pca.explained_variance_ratio().unwrap();
println!("Variance captured: {:.1}%",
         var_ratio.iter().sum::<f32>() * 100.0);

// 4. Reconstruct
let reconstructed_scaled = pca.inverse_transform(&reduced)?;
let reconstructed = scaler.inverse_transform(&reconstructed_scaled)?;

// 5. Compute reconstruction error
let rmse = compute_rmse(&iris_data, &reconstructed);
println!("Reconstruction RMSE: {:.4}", rmse);

Typical results:

  • PC1 + PC2 capture ~96% of Iris variance
  • 2D projection enables visualization of 3 species
  • RMSE ≈ 0.18 (small reconstruction error)

Further Reading

  • Foundations: Jolliffe, I.T. "Principal Component Analysis" (2002)
  • SVD connection: PCA via SVD instead of covariance eigendecomposition
  • Kernel PCA: Non-linear extension using kernel trick
  • Incremental PCA: Online algorithm for streaming data
  • Randomized PCA: Approximate PCA for very high dimensions (p > 10,000)

API Reference

// Constructor
pub fn new(n_components: usize) -> Self

// Transformer trait
fn fit(&mut self, x: &Matrix<f32>) -> Result<(), &'static str>
fn transform(&self, x: &Matrix<f32>) -> Result<Matrix<f32>, &'static str>

// Accessors
pub fn explained_variance(&self) -> Option<&[f32]>
pub fn explained_variance_ratio(&self) -> Option<&[f32]>
pub fn components(&self) -> Option<&Matrix<f32>>

// Reconstruction
pub fn inverse_transform(&self, x: &Matrix<f32>) -> Result<Matrix<f32>, &'static str>

See also:

  • preprocessing::StandardScaler - Always use before PCA
  • examples/pca_iris.rs - Complete walkthrough
  • traits::Transformer - Composable preprocessing pipeline