Data Preprocessing with Scalers

This example demonstrates feature scaling with StandardScaler and MinMaxScaler, two fundamental data preprocessing techniques used before training machine learning models.

Overview

Feature scaling ensures that all features are on comparable scales, which is crucial for many ML algorithms (especially distance-based methods like K-NN, SVM, and neural networks).

Running the Example

cargo run --example data_preprocessing_scalers

Key Concepts

StandardScaler (Z-score Normalization)

StandardScaler transforms features to have:

Mean = 0 (centers data)
Standard Deviation = 1 (scales data)

Formula: z = (x - μ) / σ

When to use:

Data is approximately normally distributed
Presence of outliers (more robust than MinMax)
Algorithms sensitive to feature scale (SVM, neural networks)
Want to preserve relative distances

MinMaxScaler (Range Normalization)

MinMaxScaler transforms features to a specific range (default [0, 1]):

Formula: x' = (x - min) / (max - min)

When to use:

Need specific output range (e.g., [0, 1] for probabilities)
Data not normally distributed
No outliers present
Want to preserve zero values
Image processing (pixel normalization)

Examples Demonstrated

Example 1: StandardScaler Basics

Shows how StandardScaler transforms data with different scales:

Original Data:
  Feature 0: [100, 200, 300, 400, 500]
  Feature 1: [1, 2, 3, 4, 5]

Computed Statistics:
  Mean: [300.0, 3.0]
  Std:  [141.42, 1.41]

After StandardScaler:
  Sample 0: [-1.41, -1.41]
  Sample 1: [-0.71, -0.71]
  Sample 2: [ 0.00,  0.00]
  Sample 3: [ 0.71,  0.71]
  Sample 4: [ 1.41,  1.41]

Both features now have mean=0 and std=1, despite very different original scales.

Example 2: MinMaxScaler Basics

Shows how MinMaxScaler transforms to [0, 1] range:

Original Data:
  Feature 0: [10, 20, 30, 40, 50]
  Feature 1: [100, 200, 300, 400, 500]

After MinMaxScaler [0, 1]:
  Sample 0: [0.00, 0.00]
  Sample 1: [0.25, 0.25]
  Sample 2: [0.50, 0.50]
  Sample 3: [0.75, 0.75]
  Sample 4: [1.00, 1.00]

Both features now in [0, 1] range with identical relative positions.

Example 3: Handling Outliers

Demonstrates how each scaler responds to outliers:

Data with Outlier: [1, 2, 3, 4, 5, 100]

  Original  StandardScaler  MinMaxScaler
  ----------------------------------------
       1.0           -0.50          0.00
       2.0           -0.47          0.01
       3.0           -0.45          0.02
       4.0           -0.42          0.03
       5.0           -0.39          0.04
     100.0            2.23          1.00

Observations:

StandardScaler: Outlier is ~2.3 standard deviations from mean (less compression)
MinMaxScaler: Outlier compresses all other values near 0 (heavily affected)

Recommendation: Use StandardScaler when outliers are present.

Example 4: Impact on K-NN Classification

Shows why scaling is critical for distance-based algorithms:

Dataset: Employee classification
  Feature 0: Salary (50-95k, range=45)
  Feature 1: Age (25-42 years, range=17)

Test: Salary=70k, Age=33

Without scaling: Distance dominated by salary
With scaling:    Both features contribute equally

Why it matters:

K-NN uses Euclidean distance
Large-scale features (salary) dominate the calculation
Small differences in age (2-3 years) become negligible
Scaling equalizes feature importance

Example 5: Custom Range Scaling

Demonstrates MinMaxScaler with custom ranges:

let scaler = MinMaxScaler::new().with_range(-1.0, 1.0);

Common use cases:

[-1, 1]: Neural networks with tanh activation
[0, 1]: Probabilities, image pixels (standard)
[0, 255]: 8-bit image processing

Example 6: Inverse Transformation

Shows how to recover original scale after scaling:

let scaled = scaler.fit_transform(&original).unwrap();
let recovered = scaler.inverse_transform(&scaled).unwrap();
// recovered == original (within floating point precision)

When to use:

Interpreting model coefficients in original units
Presenting predictions to end users
Visualizing scaled data
Debugging transformations

Best Practices

1. Fit Only on Training Data

// ✅ Correct
let mut scaler = StandardScaler::new();
scaler.fit(&x_train).unwrap();              // Fit on training data
let x_train_scaled = scaler.transform(&x_train).unwrap();
let x_test_scaled = scaler.transform(&x_test).unwrap();  // Same scaler on test

// ❌ Incorrect (data leakage!)
scaler.fit(&x_test).unwrap();  // Never fit on test data

2. Use fit_transform() for Convenience

// Shortcut for training data
let x_train_scaled = scaler.fit_transform(&x_train).unwrap();

// Equivalent to:
scaler.fit(&x_train).unwrap();
let x_train_scaled = scaler.transform(&x_train).unwrap();

3. Save Scaler with Model

The scaler is part of your model pipeline and must be saved/loaded with the model to ensure consistent preprocessing at prediction time.

4. Check if Scaler is Fitted

if scaler.is_fitted() {
    // Safe to transform
}

Decision Guide

Choose StandardScaler when:

✅ Data is approximately normally distributed
✅ Outliers are present
✅ Using linear models, SVM, neural networks
✅ Want interpretable z-scores

Choose MinMaxScaler when:

✅ Need specific output range
✅ No outliers present
✅ Data not normally distributed
✅ Using image data
✅ Want to preserve zero values
✅ Using algorithms that require specific range (e.g., sigmoid activation)

Don't Scale when:

❌ Using tree-based methods (Decision Trees, Random Forests, GBM)
❌ Features already on same scale
❌ Scale carries semantic meaning (e.g., age, count data)

Implementation Details

Both scalers implement the Transformer trait with methods:

fit(x) - Compute statistics from data
transform(x) - Apply transformation
fit_transform(x) - Fit then transform
inverse_transform(x) - Reverse transformation

Both scalers:

Work with Matrix<f32> from aprender primitives
Store statistics (mean/std or min/max) per feature
Support builder pattern for configuration
Return Result for error handling

Common Pitfalls

Fitting on test data: Always fit scaler on training data only
Forgetting to scale test data: Must apply same transformation to test set
Using wrong scaler: MinMaxScaler sensitive to outliers
Over-scaling: Don't scale tree-based models
Losing the scaler: Save scaler with model for production use

K-Nearest Neighbors - Distance-based classification
Descriptive Statistics - Computing mean and std
Linear Regression - Model that benefits from scaling

Key Takeaways

Feature scaling is essential for distance-based and gradient-based algorithms
StandardScaler is robust to outliers and preserves relative distances
MinMaxScaler gives exact range control but is outlier-sensitive
Always fit on training data and transform both train and test sets
Save scalers with models for consistent production predictions
Tree-based models don't need scaling - they're scale-invariant
Use inverse_transform() to interpret results in original units

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning