Case Study: KNN Iris

This case study demonstrates K-Nearest Neighbors (kNN) classification on the Iris dataset, exploring the effects of k values, distance metrics, and voting strategies to achieve 90% test accuracy.

Overview

We'll apply kNN to Iris flower data to:

Classify three species (Setosa, Versicolor, Virginica)
Explore the effect of k parameter (1, 3, 5, 7, 9)
Compare distance metrics (Euclidean, Manhattan, Minkowski)
Analyze weighted vs uniform voting
Generate probabilistic predictions with confidence scores

Running the Example

cargo run --example knn_iris

Expected output: Comprehensive kNN analysis including accuracy for different k values, distance metric comparison, voting strategy comparison, and probabilistic predictions with confidence scores.

Dataset

Iris Flower Measurements

// Features: [sepal_length, sepal_width, petal_length, petal_width]
// Classes: 0=Setosa, 1=Versicolor, 2=Virginica

// Training set: 20 samples (7 Setosa, 7 Versicolor, 6 Virginica)
let x_train = Matrix::from_vec(20, 4, vec![
    // Setosa (small petals, large sepals)
    5.1, 3.5, 1.4, 0.2,
    4.9, 3.0, 1.4, 0.2,
    ...
    // Versicolor (medium petals and sepals)
    7.0, 3.2, 4.7, 1.4,
    6.4, 3.2, 4.5, 1.5,
    ...
    // Virginica (large petals and sepals)
    6.3, 3.3, 6.0, 2.5,
    5.8, 2.7, 5.1, 1.9,
    ...
])?;

// Test set: 10 samples (3 Setosa, 3 Versicolor, 4 Virginica)

Dataset characteristics:

20 training samples (67% of 30-sample dataset)
10 test samples (33% of dataset)
4 continuous features (all in centimeters)
3 well-separated species classes
Balanced class distribution in training set

Part 1: Basic kNN (k=3)

Implementation

use aprender::classification::KNearestNeighbors;
use aprender::primitives::Matrix;

let mut knn = KNearestNeighbors::new(3);
knn.fit(&x_train, &y_train)?;

let predictions = knn.predict(&x_test)?;
let accuracy = compute_accuracy(&predictions, &y_test);

Results

Test Accuracy: 90.0%

Analysis:

9 out of 10 test samples correctly classified
k=3 provides good balance between bias and variance
Works well even without hyperparameter tuning

Part 2: Effect of k Parameter

Experiment

for k in [1, 3, 5, 7, 9] {
    let mut knn = KNearestNeighbors::new(k);
    knn.fit(&x_train, &y_train)?;
    let predictions = knn.predict(&x_test)?;
    let accuracy = compute_accuracy(&predictions, &y_test);
    println!("k={}: Accuracy = {:.1}%", k, accuracy * 100.0);
}

Results

k=1: Accuracy = 90.0%
k=3: Accuracy = 90.0%
k=5: Accuracy = 80.0%
k=7: Accuracy = 80.0%
k=9: Accuracy = 80.0%

Interpretation

Small k (1-3):

90% accuracy: Best performance on this dataset
k=1 memorizes training data perfectly (lazy learning)
k=3 balances local patterns with noise reduction
Risk: Overfitting, sensitive to outliers

Large k (5-9):

80% accuracy: Performance degrades
Decision boundaries become smoother
More robust to noise but loses fine distinctions
k=9 uses 45% of training data for each prediction (9/20)
Risk: Underfitting, class boundaries blur

Optimal k:

For this dataset: k=3 provides best test accuracy
General rule: k ≈ √n = √20 ≈ 4.5 (close to optimal)
Use cross-validation for systematic selection

Part 3: Distance Metrics (k=5)

Comparison

let mut knn_euclidean = KNearestNeighbors::new(5)
    .with_metric(DistanceMetric::Euclidean);

let mut knn_manhattan = KNearestNeighbors::new(5)
    .with_metric(DistanceMetric::Manhattan);

let mut knn_minkowski = KNearestNeighbors::new(5)
    .with_metric(DistanceMetric::Minkowski(3.0));

Results

Euclidean distance:   80.0%
Manhattan distance:   80.0%
Minkowski (p=3):      80.0%

Interpretation

Identical performance (80%) across all metrics for k=5.

Why?:

Iris features (sepal/petal dimensions) are all continuous and similarly scaled
All three metrics capture species differences effectively
Ranking of neighbors is similar across metrics

When metrics differ:

Euclidean: Best for continuous, normally distributed features
Manhattan: Better for count data or when outliers present
Minkowski (p>2): Emphasizes dimensions with largest differences

Recommendation: Use Euclidean (default) for continuous features, Manhattan for robustness to outliers.

Part 4: Weighted vs Uniform Voting

Comparison

// Uniform voting: all neighbors count equally
let mut knn_uniform = KNearestNeighbors::new(5);
knn_uniform.fit(&x_train, &y_train)?;

// Weighted voting: closer neighbors count more
let mut knn_weighted = KNearestNeighbors::new(5).with_weights(true);
knn_weighted.fit(&x_train, &y_train)?;

Results

Uniform voting:   80.0%
Weighted voting:  90.0%

Interpretation

Weighted voting improves accuracy by 10% (from 80% to 90%).

Why weighted voting helps:

Gives more influence to closer (more similar) neighbors
Reduces impact of distant outliers in k=5 neighborhood
More intuitive: "very close neighbors matter more"
Weight formula: w_i = 1 / distance_i

Example scenario:

Neighbor distances for test sample:
  Neighbor 1: d=0.2, class=Versicolor, weight=5.0
  Neighbor 2: d=0.3, class=Versicolor, weight=3.3
  Neighbor 3: d=0.5, class=Versicolor, weight=2.0
  Neighbor 4: d=1.8, class=Setosa,     weight=0.56
  Neighbor 5: d=2.0, class=Setosa,     weight=0.50

Uniform: 3 votes Versicolor, 2 votes Setosa → Versicolor (60%)
Weighted: 10.3 weighted votes Versicolor, 1.06 Setosa → Versicolor (91%)

Recommendation: Use weighted voting for k ≥ 5, uniform for k ≤ 3.

Part 5: Probabilistic Predictions

Implementation

let mut knn_proba = KNearestNeighbors::new(5).with_weights(true);
knn_proba.fit(&x_train, &y_train)?;

let probabilities = knn_proba.predict_proba(&x_test)?;
let predictions = knn_proba.predict(&x_test)?;

Results

Sample  Predicted  Setosa  Versicolor  Virginica
─────────────────────────────────────────────────────
   0     Setosa       100.0%    0.0%       0.0%
   1     Setosa       100.0%    0.0%       0.0%
   2     Setosa       100.0%    0.0%       0.0%
   3     Versicolor   30.4%    69.6%       0.0%
   4     Versicolor   0.0%    100.0%       0.0%

Interpretation

Sample 0-2 (Setosa):

100% confidence: All 5 nearest neighbors are Setosa
Perfect separation from other species
Small petals (1.4-1.5 cm) characteristic of Setosa

Sample 3 (Versicolor):

69.6% confidence: Some Setosa neighbors nearby
30.4% Setosa: Near species boundary
Medium features create some overlap

Sample 4 (Versicolor):

100% confidence: Clear Versicolor region
All 5 neighbors are Versicolor

Confidence interpretation:

90-100%: High confidence, far from decision boundary
70-90%: Medium confidence, near boundary
50-70%: Low confidence, ambiguous region
<50%: Prediction uncertain, manual review recommended

Best Configuration

Summary

Best configuration found:
- k = 5 neighbors
- Distance metric: Euclidean
- Voting: Weighted by inverse distance
- Test accuracy: 90.0%

Why This Works

k=5: Large enough to be robust, small enough to capture local patterns
Euclidean: Natural for continuous features
Weighted voting: Leverages proximity information effectively
90% accuracy: Excellent for 10-sample test set (1 misclassification)

Comparison to Other Classifiers

Classifier	Iris Accuracy	Training Time	Prediction Time
kNN (k=5, weighted)	90%	Instant	O(n) per sample
Logistic Regression	90-95%	Fast	Very fast
Decision Tree	85-95%	Medium	Fast
Random Forest	95-100%	Slow	Medium

kNN provides competitive accuracy with zero training time but slower predictions.

Key Insights

1. Small k (1-3)

Risk of overfitting
Sensitive to noise and outliers
Captures fine-grained decision boundaries
Best when data is clean and well-separated

2. Large k (7-9)

Risk of underfitting
Class boundaries blur together
More robust to noise
Best when data is noisy or classes overlap

3. Weighted Voting

Gives more influence to closer neighbors
Critical improvement: 80% → 90% accuracy for k=5
Especially beneficial for larger k values
More intuitive than uniform voting

4. Distance Metric Selection

Euclidean: Best for continuous features (default choice)
Manhattan: More robust to outliers
Minkowski: Tunable between Euclidean and Manhattan
For Iris: All metrics perform similarly (well-behaved data)

Performance Metrics

Time Complexity

Operation	Iris Dataset	General (n=20, p=4, k=5)
Training (fit)	0.001 ms	O(1) - just stores data
Distance computation	0.02 ms	O(n·p) per sample
Finding k-nearest	0.01 ms	O(n log k) per sample
Voting	<0.001 ms	O(k·c) per sample
Total prediction	~0.03 ms	O(n·p) per sample

Bottleneck: Distance computation dominates (67% of time).

Memory Usage

Training storage:

x_train: 20×4×4 = 320 bytes
y_train: 20×8 = 160 bytes
Total: ~480 bytes

Per-sample prediction:

Distance array: 20×4 = 80 bytes
Neighbor buffer: 5×12 = 60 bytes
Total: ~140 bytes per sample

Scalability: kNN requires storing entire training set, making it memory-intensive for large datasets (n > 100,000).

Full Code

use aprender::classification::{KNearestNeighbors, DistanceMetric};
use aprender::primitives::Matrix;

// 1. Load data
let (x_train, y_train, x_test, y_test) = load_iris_data()?;

// 2. Basic kNN
let mut knn = KNearestNeighbors::new(3);
knn.fit(&x_train, &y_train)?;
let predictions = knn.predict(&x_test)?;
println!("Accuracy: {:.1}%", compute_accuracy(&predictions, &y_test) * 100.0);

// 3. Hyperparameter tuning
for k in [1, 3, 5, 7, 9] {
    let mut knn = KNearestNeighbors::new(k);
    knn.fit(&x_train, &y_train)?;
    let acc = compute_accuracy(&knn.predict(&x_test)?, &y_test);
    println!("k={}: {:.1}%", k, acc * 100.0);
}

// 4. Best model with weighted voting
let mut knn_best = KNearestNeighbors::new(5)
    .with_weights(true);
knn_best.fit(&x_train, &y_train)?;

// 5. Probabilistic predictions
let probabilities = knn_best.predict_proba(&x_test)?;
for (i, &pred) in knn_best.predict(&x_test)?.iter().enumerate() {
    println!("Sample {}: class={}, confidence={:.1}%",
             i, pred, probabilities[i][pred] * 100.0);
}

Further Exploration

Try different k values:

// Very small k (high variance)
let knn1 = KNearestNeighbors::new(1);  // Perfect training fit

// Very large k (high bias)
let knn15 = KNearestNeighbors::new(15); // 75% of training data

Feature importance analysis:

Remove one feature at a time
Measure impact on accuracy
Identify most discriminative features (likely petal dimensions)

Cross-validation:

Split data into 5 folds
Average accuracy across folds
More robust performance estimate than single train/test split

Standardization effect:

Compare with/without StandardScaler
Iris features are already similar scale (all in cm)
Expect minimal difference, but good practice

examples/iris_clustering.rs - K-Means on same dataset
book/src/ml-fundamentals/knn.md - Full kNN theory
examples/logistic-regression.md - Parametric alternative

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning