Feature Scaling Theory
Feature scaling is a critical preprocessing step that transforms features to similar scales. Proper scaling dramatically improves convergence speed and model performance, especially for distance-based algorithms and gradient descent optimization.
Why Feature Scaling Matters
Problem: Features on Different Scales
Consider a dataset with two features:
Feature 1 (salary): [30,000, 50,000, 80,000, 120,000] Range: 90,000
Feature 2 (age): [25, 30, 35, 40] Range: 15
Issue: Salary values are ~6000x larger than age values!
Impact on Machine Learning Algorithms
1. Gradient Descent
Without scaling, loss surface becomes elongated:
Unscaled Loss Surface:
θ₁ (salary coefficient)
↑
1000 ┤●
800 ┤ ●
600 ┤ ●
400 ┤ ● ← Very elongated
200 ┤ ●●●●●●●●●●●●●●●●●
0 └────────────────────────→
θ₂ (age coefficient)
Problem: Gradient descent takes tiny steps in θ₁ direction,
large steps in θ₂ direction → zig-zagging, slow convergence
With scaling, loss surface becomes circular:
Scaled Loss Surface:
θ₁
↑
1.0 ┤
0.8 ┤ ●●●
0.6 ┤ ● ● ← Circular contours
0.4 ┤ ● ✖ ● (✖ = optimal)
0.2 ┤ ● ●
0.0 └───●●●─────→
θ₂
Result: Gradient descent takes efficient path to minimum
Convergence speed: Scaling can improve training time by 10-100x!
2. Distance-Based Algorithms (K-NN, K-Means, SVM)
Euclidean distance formula:
d = √((x₁-y₁)² + (x₂-y₂)²)
With unscaled features:
Sample A: (salary=50000, age=30)
Sample B: (salary=51000, age=35)
Distance = √((51000-50000)² + (35-30)²)
= √(1000² + 5²)
= √(1,000,000 + 25)
= √1,000,025
≈ 1000.01
Contribution to distance:
Salary: 1,000,000 / 1,000,025 ≈ 99.997%
Age: 25 / 1,000,025 ≈ 0.003%
Problem: Age is completely ignored! K-NN makes decisions based solely on salary.
With scaled features (both in range [0, 1]):
Scaled A: (0.2, 0.33)
Scaled B: (0.3, 0.67)
Distance = √((0.3-0.2)² + (0.67-0.33)²)
= √(0.01 + 0.1156)
= √0.1256
≈ 0.354
Contribution to distance:
Salary: 0.01 / 0.1256 ≈ 8%
Age: 0.1156 / 0.1256 ≈ 92%
Result: Both features contribute meaningfully to distance calculation.
Scaling Methods
Comparison Table
| Method | Formula | Range | Best For | Outlier Sensitive |
|---|---|---|---|---|
| StandardScaler | (x - μ) / σ | Unbounded, ~[-3, 3] | Normal distributions | Low |
| MinMaxScaler | (x - min) / (max - min) | [0, 1] or custom | Known bounds needed | High |
| RobustScaler | (x - median) / IQR | Unbounded | Data with outliers | Low |
| MaxAbsScaler | x / |max| | [-1, 1] | Sparse data, preserves zeros | High |
| Normalization (L2) | x / ‖x‖₂ | Unit sphere | Text, TF-IDF vectors | N/A |
StandardScaler: Z-Score Normalization
Key idea: Center data at zero, scale by standard deviation.
Formula
x' = (x - μ) / σ
Where:
μ = mean of feature
σ = standard deviation of feature
Properties
After standardization:
- Mean = 0
- Standard deviation = 1
- Distribution shape preserved
Algorithm
1. Fit phase (training data):
μ = (1/N) Σ xᵢ // Compute mean
σ = √[(1/N) Σ (xᵢ - μ)²] // Compute std
2. Transform phase:
x'ᵢ = (xᵢ - μ) / σ // Scale each sample
3. Inverse transform (optional):
xᵢ = x'ᵢ × σ + μ // Recover original scale
Example
Original data: [1, 2, 3, 4, 5]
Step 1: Compute statistics
μ = (1+2+3+4+5) / 5 = 3
σ = √[(1-3)² + (2-3)² + (3-3)² + (4-3)² + (5-3)²] / 5
= √[4 + 1 + 0 + 1 + 4] / 5
= √2 ≈ 1.414
Step 2: Transform
x'₁ = (1 - 3) / 1.414 = -1.414
x'₂ = (2 - 3) / 1.414 = -0.707
x'₃ = (3 - 3) / 1.414 = 0.000
x'₄ = (4 - 3) / 1.414 = 0.707
x'₅ = (5 - 3) / 1.414 = 1.414
Result: [-1.414, -0.707, 0.000, 0.707, 1.414]
Mean = 0, Std = 1 ✓
aprender Implementation
use aprender::preprocessing::StandardScaler;
use aprender::primitives::Matrix;
// Create scaler
let mut scaler = StandardScaler::new();
// Fit on training data
scaler.fit(&x_train)?;
// Transform training and test data
let x_train_scaled = scaler.transform(&x_train)?;
let x_test_scaled = scaler.transform(&x_test)?;
// Access learned statistics
println!("Mean: {:?}", scaler.mean());
println!("Std: {:?}", scaler.std());
// Inverse transform (recover original scale)
let x_recovered = scaler.inverse_transform(&x_train_scaled)?;
Advantages
- Robust to outliers: Outliers affect mean/std less than min/max
- Maintains distribution shape: Useful for normally distributed data
- Unbounded output: Can handle values outside training range
- Interpretable: "How many standard deviations from the mean?"
Disadvantages
- Assumes normality: Less effective for heavily skewed distributions
- Unbounded range: Output not in [0, 1] if that's required
- Outliers still affect: Mean and std sensitive to extreme values
When to Use
✅ Use StandardScaler for:
- Features with approximately normal distribution
- Gradient-based optimization (neural networks, logistic regression)
- SVM with RBF kernel
- PCA (Principal Component Analysis)
- Data with moderate outliers
❌ Avoid StandardScaler for:
- Need strict [0, 1] bounds (use MinMaxScaler)
- Heavy outliers (use RobustScaler)
- Sparse data with many zeros (use MaxAbsScaler)
MinMaxScaler: Range Normalization
Key idea: Scale features to a fixed range, typically [0, 1].
Formula
x' = (x - min) / (max - min) // Scale to [0, 1]
x' = a + (x - min) × (b - a) / (max - min) // Scale to [a, b]
Properties
After min-max scaling to [0, 1]:
- Minimum value → 0
- Maximum value → 1
- Linear transformation (preserves relationships)
Algorithm
1. Fit phase (training data):
min = minimum value in feature
max = maximum value in feature
range = max - min
2. Transform phase:
x'ᵢ = (xᵢ - min) / range
3. Inverse transform:
xᵢ = x'ᵢ × range + min
Example
Original data: [10, 20, 30, 40, 50]
Step 1: Compute range
min = 10
max = 50
range = 50 - 10 = 40
Step 2: Transform to [0, 1]
x'₁ = (10 - 10) / 40 = 0.00
x'₂ = (20 - 10) / 40 = 0.25
x'₃ = (30 - 10) / 40 = 0.50
x'₄ = (40 - 10) / 40 = 0.75
x'₅ = (50 - 10) / 40 = 1.00
Result: [0.00, 0.25, 0.50, 0.75, 1.00]
Min = 0, Max = 1 ✓
Custom Range Example
Scale to [-1, 1] for neural networks with tanh activation:
Original: [10, 20, 30, 40, 50]
Range: [min=10, max=50]
Formula: x' = -1 + (x - 10) × 2 / 40
Result:
10 → -1.0
20 → -0.5
30 → 0.0
40 → 0.5
50 → 1.0
aprender Implementation
use aprender::preprocessing::MinMaxScaler;
// Scale to [0, 1] (default)
let mut scaler = MinMaxScaler::new();
// Scale to custom range [-1, 1]
let mut scaler = MinMaxScaler::new()
.with_range(-1.0, 1.0);
// Fit and transform
scaler.fit(&x_train)?;
let x_train_scaled = scaler.transform(&x_train)?;
let x_test_scaled = scaler.transform(&x_test)?;
// Access learned parameters
println!("Data min: {:?}", scaler.data_min());
println!("Data max: {:?}", scaler.data_max());
// Inverse transform
let x_recovered = scaler.inverse_transform(&x_train_scaled)?;
Advantages
- Bounded output: Guaranteed range [0, 1] or custom
- Preserves zero: If data contains zeros, they remain zeros
- Interpretable: "What percentage of the range?"
- No assumptions: Works with any distribution
Disadvantages
- Sensitive to outliers: Single extreme value affects entire scaling
- Bounded by training data: Test values outside [train_min, train_max] → outside [0, 1]
- Distorts distribution: Outliers compress main data range
When to Use
✅ Use MinMaxScaler for:
- Neural networks with sigmoid/tanh activation
- Bounded features needed (e.g., image pixels)
- No outliers present
- Features with known bounds
- When interpretability as "percentage" is useful
❌ Avoid MinMaxScaler for:
- Data with outliers (they compress everything else)
- Test data may have values outside training range
- Need to preserve distribution shape
Outlier Handling Comparison
Dataset with Outlier
Data: [1, 2, 3, 4, 5, 100] ← 100 is an outlier
StandardScaler (Less Affected)
μ = (1+2+3+4+5+100) / 6 ≈ 19.17
σ ≈ 37.85
Scaled:
1 → (1-19.17)/37.85 ≈ -0.48
2 → (2-19.17)/37.85 ≈ -0.45
3 → (3-19.17)/37.85 ≈ -0.43
4 → (4-19.17)/37.85 ≈ -0.40
5 → (5-19.17)/37.85 ≈ -0.37
100 → (100-19.17)/37.85 ≈ 2.14
Main data: [-0.48 to -0.37] (range ≈ 0.11)
Outlier: 2.14
Effect: Outlier shifted but main data still usable, relatively compressed.
MinMaxScaler (Heavily Affected)
min = 1, max = 100, range = 99
Scaled:
1 → (1-1)/99 = 0.000
2 → (2-1)/99 = 0.010
3 → (3-1)/99 = 0.020
4 → (4-1)/99 = 0.030
5 → (5-1)/99 = 0.040
100 → (100-1)/99 = 1.000
Main data: [0.000 to 0.040] (compressed to 4% of range!)
Outlier: 1.000
Effect: Outlier uses 96% of range, main data compressed to tiny interval.
Lesson: Use StandardScaler or RobustScaler when outliers are present!
When to Scale Features
Algorithms That REQUIRE Scaling
These algorithms fail or perform poorly without scaling:
| Algorithm | Why Scaling Needed |
|---|---|
| K-Nearest Neighbors | Distance calculation dominated by large-scale features |
| K-Means Clustering | Centroid calculation uses Euclidean distance |
| Support Vector Machines | Distance to hyperplane affected by feature scales |
| Principal Component Analysis | Variance calculation dominated by large-scale features |
| Gradient Descent | Elongated loss surface causes slow convergence |
| Neural Networks | Weights initialized for similar input scales |
| Logistic Regression | Gradient descent convergence issues |
Algorithms That DON'T Need Scaling
These algorithms are scale-invariant:
| Algorithm | Why Scaling Not Needed |
|---|---|
| Decision Trees | Splits based on thresholds, not distances |
| Random Forests | Ensemble of decision trees |
| Gradient Boosting | Based on decision trees |
| Naive Bayes | Works with probability distributions |
Exception: Even for tree-based models, scaling can help if using regularization or mixed with other algorithms.
Critical Workflow Rules
Rule 1: Fit on Training Data ONLY
// ❌ WRONG: Fitting on all data leaks information
scaler.fit(&x_all)?;
let x_train_scaled = scaler.transform(&x_train)?;
let x_test_scaled = scaler.transform(&x_test)?;
// ✅ CORRECT: Fit only on training data
scaler.fit(&x_train)?; // Learn μ, σ from training only
let x_train_scaled = scaler.transform(&x_train)?;
let x_test_scaled = scaler.transform(&x_test)?; // Apply same μ, σ
Why? Fitting on test data creates data leakage:
- Test set statistics influence scaling
- Model indirectly "sees" test data during training
- Overly optimistic performance estimates
- Fails in production (new data has different statistics)
Rule 2: Same Scaler for Train and Test
// ❌ WRONG: Different scalers
let mut train_scaler = StandardScaler::new();
train_scaler.fit(&x_train)?;
let x_train_scaled = train_scaler.transform(&x_train)?;
let mut test_scaler = StandardScaler::new();
test_scaler.fit(&x_test)?; // ← WRONG! Different statistics
let x_test_scaled = test_scaler.transform(&x_test)?;
// ✅ CORRECT: Same scaler
let mut scaler = StandardScaler::new();
scaler.fit(&x_train)?;
let x_train_scaled = scaler.transform(&x_train)?;
let x_test_scaled = scaler.transform(&x_test)?; // Same statistics
Rule 3: Scale Before Splitting? NO!
// ❌ WRONG: Scale before train/test split
scaler.fit(&x_all)?;
let x_scaled = scaler.transform(&x_all)?;
let (x_train, x_test, ...) = train_test_split(&x_scaled, ...)?;
// ✅ CORRECT: Split before scaling
let (x_train, x_test, ...) = train_test_split(&x, ...)?;
scaler.fit(&x_train)?;
let x_train_scaled = scaler.transform(&x_train)?;
let x_test_scaled = scaler.transform(&x_test)?;
Rule 4: Save Scaler for Production
// Training phase
let mut scaler = StandardScaler::new();
scaler.fit(&x_train)?;
// Save scaler parameters
let scaler_params = ScalerParams {
mean: scaler.mean().clone(),
std: scaler.std().clone(),
};
save_to_disk(&scaler_params, "scaler.json")?;
// Production phase (months later)
let scaler_params = load_from_disk("scaler.json")?;
let mut scaler = StandardScaler::from_params(scaler_params);
let x_new_scaled = scaler.transform(&x_new)?;
Feature-Specific Scaling Strategies
Numerical Features
Continuous variables (age, salary, temperature):
- StandardScaler if approximately normal
- MinMaxScaler if bounded and no outliers
- RobustScaler if outliers present
Binary Features (0/1)
No scaling needed!
Original: [0, 1, 0, 1, 1] ← Already in [0, 1]
Don't scale: Breaks semantic meaning (presence/absence)
Count Features
Examples: Number of purchases, page visits, words in document
Strategy: Consider log transformation first, then scale
// Apply log transform
let x_log: Vec<f32> = x.iter()
.map(|&count| (count + 1.0).ln()) // +1 to handle zeros
.collect();
// Then scale
scaler.fit(&x_log)?;
let x_scaled = scaler.transform(&x_log)?;
Categorical Features (Encoded)
One-hot encoded: No scaling needed (already 0/1) Label encoded (ordinal): Scale if using distance-based algorithms
Impact on Model Performance
Example: K-NN on Employee Data
Dataset:
Feature 1: Salary [30k-120k]
Feature 2: Age [25-40]
Feature 3: Years of experience [1-15]
Task: Predict employee attrition
Without scaling:
K-NN accuracy: 62%
(Salary dominates distance calculation)
With StandardScaler:
K-NN accuracy: 84%
(All features contribute meaningfully)
Improvement: +22 percentage points! ✅
Example: Neural Network Convergence
Network: 3-layer MLP
Dataset: Mixed-scale features
Without scaling:
Epochs to converge: 500
Training time: 45 seconds
With StandardScaler:
Epochs to converge: 50
Training time: 5 seconds
Speedup: 9x faster! ✅
Decision Guide
Flowchart: Which Scaler?
Start
│
├─ Are there outliers?
│ ├─ YES → RobustScaler
│ └─ NO → Continue
│
├─ Need bounded range [0,1]?
│ ├─ YES → MinMaxScaler
│ └─ NO → Continue
│
├─ Is data approximately normal?
│ ├─ YES → StandardScaler ✓ (default choice)
│ └─ NO → Continue
│
├─ Is data sparse (many zeros)?
│ ├─ YES → MaxAbsScaler
│ └─ NO → StandardScaler
Quick Reference
| Your Situation | Recommended Scaler |
|---|---|
| Default choice, unsure | StandardScaler |
| Neural networks | StandardScaler or MinMaxScaler |
| K-NN, K-Means, SVM | StandardScaler |
| Data has outliers | RobustScaler |
| Need [0,1] bounds | MinMaxScaler |
| Sparse data | MaxAbsScaler |
| Tree-based models | No scaling (optional) |
Common Mistakes
Mistake 1: Forgetting to Scale Test Data
// ❌ WRONG
scaler.fit(&x_train)?;
let x_train_scaled = scaler.transform(&x_train)?;
// ... train model on x_train_scaled ...
let predictions = model.predict(&x_test)?; // ← Unscaled!
Result: Model sees different scale at test time, terrible performance.
Mistake 2: Scaling Target Variable Unnecessarily
// ❌ Usually unnecessary for regression targets
scaler_y.fit(&y_train)?;
let y_train_scaled = scaler_y.transform(&y_train)?;
When needed: Only if target has extreme range (e.g., house prices in millions)
Better solution: Use regularization or log-transform target
Mistake 3: Scaling Categorical Encoded Features
// One-hot encoded: [1, 0, 0] for category A
// [0, 1, 0] for category B
// ❌ WRONG: Scaling destroys categorical meaning
scaler.fit(&one_hot_encoded)?;
Correct: Don't scale one-hot encoded features!
aprender Example: Complete Pipeline
use aprender::preprocessing::StandardScaler;
use aprender::classification::KNearestNeighbors;
use aprender::model_selection::train_test_split;
use aprender::prelude::*;
fn full_pipeline_example(x: &Matrix<f32>, y: &Vec<i32>) -> Result<f32> {
// 1. Split data FIRST
let (x_train, x_test, y_train, y_test) =
train_test_split(x, y, 0.2, Some(42))?;
// 2. Create and fit scaler on training data ONLY
let mut scaler = StandardScaler::new();
scaler.fit(&x_train)?;
// 3. Transform both train and test using same scaler
let x_train_scaled = scaler.transform(&x_train)?;
let x_test_scaled = scaler.transform(&x_test)?;
// 4. Train model on scaled data
let mut model = KNearestNeighbors::new(5);
model.fit(&x_train_scaled, &y_train)?;
// 5. Evaluate on scaled test data
let accuracy = model.score(&x_test_scaled, &y_test);
println!("Learned scaling parameters:");
println!(" Mean: {:?}", scaler.mean());
println!(" Std: {:?}", scaler.std());
println!("\nTest accuracy: {:.4}", accuracy);
Ok(accuracy)
}
Further Reading
Theory:
- Standardization: Common practice in statistics since 1950s
- Min-Max Scaling: Standard normalization technique
Practical:
- sklearn documentation: Detailed scaler comparisons
- "Feature Engineering for Machine Learning" (Zheng & Casari)
Related Chapters
- Data Preprocessing with Scalers - Hands-on examples
- K-NN Iris Example - Scaling impact on K-NN
- Gradient Descent Theory - Why scaling accelerates optimization
Summary
| Concept | Key Takeaway |
|---|---|
| Why scale? | Distance-based algorithms and gradient descent need similar feature scales |
| StandardScaler | Default choice: centers at 0, scales by std dev |
| MinMaxScaler | When bounded [0,1] range needed, no outliers |
| Fit on training | CRITICAL: Only fit scaler on training data, apply to test |
| Algorithms needing scaling | K-NN, K-Means, SVM, Neural Networks, PCA |
| Algorithms NOT needing scaling | Decision Trees, Random Forests, Naive Bayes |
| Performance impact | Can improve accuracy by 20%+ and speed by 10-100x |
Feature scaling is often the single most important preprocessing step in machine learning pipelines. Proper scaling can mean the difference between a model that fails to converge and one that achieves state-of-the-art performance.