Case Study: KNN Iris
This case study demonstrates K-Nearest Neighbors (kNN) classification on the Iris dataset, exploring the effects of k values, distance metrics, and voting strategies to achieve 90% test accuracy.
Overview
We'll apply kNN to Iris flower data to:
- Classify three species (Setosa, Versicolor, Virginica)
- Explore the effect of k parameter (1, 3, 5, 7, 9)
- Compare distance metrics (Euclidean, Manhattan, Minkowski)
- Analyze weighted vs uniform voting
- Generate probabilistic predictions with confidence scores
Running the Example
cargo run --example knn_iris
Expected output: Comprehensive kNN analysis including accuracy for different k values, distance metric comparison, voting strategy comparison, and probabilistic predictions with confidence scores.
Dataset
Iris Flower Measurements
// Features: [sepal_length, sepal_width, petal_length, petal_width]
// Classes: 0=Setosa, 1=Versicolor, 2=Virginica
// Training set: 20 samples (7 Setosa, 7 Versicolor, 6 Virginica)
let x_train = Matrix::from_vec(20, 4, vec![
// Setosa (small petals, large sepals)
5.1, 3.5, 1.4, 0.2,
4.9, 3.0, 1.4, 0.2,
...
// Versicolor (medium petals and sepals)
7.0, 3.2, 4.7, 1.4,
6.4, 3.2, 4.5, 1.5,
...
// Virginica (large petals and sepals)
6.3, 3.3, 6.0, 2.5,
5.8, 2.7, 5.1, 1.9,
...
])?;
// Test set: 10 samples (3 Setosa, 3 Versicolor, 4 Virginica)
Dataset characteristics:
- 20 training samples (67% of 30-sample dataset)
- 10 test samples (33% of dataset)
- 4 continuous features (all in centimeters)
- 3 well-separated species classes
- Balanced class distribution in training set
Part 1: Basic kNN (k=3)
Implementation
use aprender::classification::KNearestNeighbors;
use aprender::primitives::Matrix;
let mut knn = KNearestNeighbors::new(3);
knn.fit(&x_train, &y_train)?;
let predictions = knn.predict(&x_test)?;
let accuracy = compute_accuracy(&predictions, &y_test);
Results
Test Accuracy: 90.0%
Analysis:
- 9 out of 10 test samples correctly classified
- k=3 provides good balance between bias and variance
- Works well even without hyperparameter tuning
Part 2: Effect of k Parameter
Experiment
for k in [1, 3, 5, 7, 9] {
let mut knn = KNearestNeighbors::new(k);
knn.fit(&x_train, &y_train)?;
let predictions = knn.predict(&x_test)?;
let accuracy = compute_accuracy(&predictions, &y_test);
println!("k={}: Accuracy = {:.1}%", k, accuracy * 100.0);
}
Results
k=1: Accuracy = 90.0%
k=3: Accuracy = 90.0%
k=5: Accuracy = 80.0%
k=7: Accuracy = 80.0%
k=9: Accuracy = 80.0%
Interpretation
Small k (1-3):
- 90% accuracy: Best performance on this dataset
- k=1 memorizes training data perfectly (lazy learning)
- k=3 balances local patterns with noise reduction
- Risk: Overfitting, sensitive to outliers
Large k (5-9):
- 80% accuracy: Performance degrades
- Decision boundaries become smoother
- More robust to noise but loses fine distinctions
- k=9 uses 45% of training data for each prediction (9/20)
- Risk: Underfitting, class boundaries blur
Optimal k:
- For this dataset: k=3 provides best test accuracy
- General rule: k ≈ √n = √20 ≈ 4.5 (close to optimal)
- Use cross-validation for systematic selection
Part 3: Distance Metrics (k=5)
Comparison
let mut knn_euclidean = KNearestNeighbors::new(5)
.with_metric(DistanceMetric::Euclidean);
let mut knn_manhattan = KNearestNeighbors::new(5)
.with_metric(DistanceMetric::Manhattan);
let mut knn_minkowski = KNearestNeighbors::new(5)
.with_metric(DistanceMetric::Minkowski(3.0));
Results
Euclidean distance: 80.0%
Manhattan distance: 80.0%
Minkowski (p=3): 80.0%
Interpretation
Identical performance (80%) across all metrics for k=5.
Why?:
- Iris features (sepal/petal dimensions) are all continuous and similarly scaled
- All three metrics capture species differences effectively
- Ranking of neighbors is similar across metrics
When metrics differ:
- Euclidean: Best for continuous, normally distributed features
- Manhattan: Better for count data or when outliers present
- Minkowski (p>2): Emphasizes dimensions with largest differences
Recommendation: Use Euclidean (default) for continuous features, Manhattan for robustness to outliers.
Part 4: Weighted vs Uniform Voting
Comparison
// Uniform voting: all neighbors count equally
let mut knn_uniform = KNearestNeighbors::new(5);
knn_uniform.fit(&x_train, &y_train)?;
// Weighted voting: closer neighbors count more
let mut knn_weighted = KNearestNeighbors::new(5).with_weights(true);
knn_weighted.fit(&x_train, &y_train)?;
Results
Uniform voting: 80.0%
Weighted voting: 90.0%
Interpretation
Weighted voting improves accuracy by 10% (from 80% to 90%).
Why weighted voting helps:
- Gives more influence to closer (more similar) neighbors
- Reduces impact of distant outliers in k=5 neighborhood
- More intuitive: "very close neighbors matter more"
- Weight formula: w_i = 1 / distance_i
Example scenario:
Neighbor distances for test sample:
Neighbor 1: d=0.2, class=Versicolor, weight=5.0
Neighbor 2: d=0.3, class=Versicolor, weight=3.3
Neighbor 3: d=0.5, class=Versicolor, weight=2.0
Neighbor 4: d=1.8, class=Setosa, weight=0.56
Neighbor 5: d=2.0, class=Setosa, weight=0.50
Uniform: 3 votes Versicolor, 2 votes Setosa → Versicolor (60%)
Weighted: 10.3 weighted votes Versicolor, 1.06 Setosa → Versicolor (91%)
Recommendation: Use weighted voting for k ≥ 5, uniform for k ≤ 3.
Part 5: Probabilistic Predictions
Implementation
let mut knn_proba = KNearestNeighbors::new(5).with_weights(true);
knn_proba.fit(&x_train, &y_train)?;
let probabilities = knn_proba.predict_proba(&x_test)?;
let predictions = knn_proba.predict(&x_test)?;
Results
Sample Predicted Setosa Versicolor Virginica
─────────────────────────────────────────────────────
0 Setosa 100.0% 0.0% 0.0%
1 Setosa 100.0% 0.0% 0.0%
2 Setosa 100.0% 0.0% 0.0%
3 Versicolor 30.4% 69.6% 0.0%
4 Versicolor 0.0% 100.0% 0.0%
Interpretation
Sample 0-2 (Setosa):
- 100% confidence: All 5 nearest neighbors are Setosa
- Perfect separation from other species
- Small petals (1.4-1.5 cm) characteristic of Setosa
Sample 3 (Versicolor):
- 69.6% confidence: Some Setosa neighbors nearby
- 30.4% Setosa: Near species boundary
- Medium features create some overlap
Sample 4 (Versicolor):
- 100% confidence: Clear Versicolor region
- All 5 neighbors are Versicolor
Confidence interpretation:
- 90-100%: High confidence, far from decision boundary
- 70-90%: Medium confidence, near boundary
- 50-70%: Low confidence, ambiguous region
- <50%: Prediction uncertain, manual review recommended
Best Configuration
Summary
Best configuration found:
- k = 5 neighbors
- Distance metric: Euclidean
- Voting: Weighted by inverse distance
- Test accuracy: 90.0%
Why This Works
- k=5: Large enough to be robust, small enough to capture local patterns
- Euclidean: Natural for continuous features
- Weighted voting: Leverages proximity information effectively
- 90% accuracy: Excellent for 10-sample test set (1 misclassification)
Comparison to Other Classifiers
| Classifier | Iris Accuracy | Training Time | Prediction Time |
|---|---|---|---|
| kNN (k=5, weighted) | 90% | Instant | O(n) per sample |
| Logistic Regression | 90-95% | Fast | Very fast |
| Decision Tree | 85-95% | Medium | Fast |
| Random Forest | 95-100% | Slow | Medium |
kNN provides competitive accuracy with zero training time but slower predictions.
Key Insights
1. Small k (1-3)
- Risk of overfitting
- Sensitive to noise and outliers
- Captures fine-grained decision boundaries
- Best when data is clean and well-separated
2. Large k (7-9)
- Risk of underfitting
- Class boundaries blur together
- More robust to noise
- Best when data is noisy or classes overlap
3. Weighted Voting
- Gives more influence to closer neighbors
- Critical improvement: 80% → 90% accuracy for k=5
- Especially beneficial for larger k values
- More intuitive than uniform voting
4. Distance Metric Selection
- Euclidean: Best for continuous features (default choice)
- Manhattan: More robust to outliers
- Minkowski: Tunable between Euclidean and Manhattan
- For Iris: All metrics perform similarly (well-behaved data)
Performance Metrics
Time Complexity
| Operation | Iris Dataset | General (n=20, p=4, k=5) |
|---|---|---|
| Training (fit) | 0.001 ms | O(1) - just stores data |
| Distance computation | 0.02 ms | O(n·p) per sample |
| Finding k-nearest | 0.01 ms | O(n log k) per sample |
| Voting | <0.001 ms | O(k·c) per sample |
| Total prediction | ~0.03 ms | O(n·p) per sample |
Bottleneck: Distance computation dominates (67% of time).
Memory Usage
Training storage:
- x_train: 20×4×4 = 320 bytes
- y_train: 20×8 = 160 bytes
- Total: ~480 bytes
Per-sample prediction:
- Distance array: 20×4 = 80 bytes
- Neighbor buffer: 5×12 = 60 bytes
- Total: ~140 bytes per sample
Scalability: kNN requires storing entire training set, making it memory-intensive for large datasets (n > 100,000).
Full Code
use aprender::classification::{KNearestNeighbors, DistanceMetric};
use aprender::primitives::Matrix;
// 1. Load data
let (x_train, y_train, x_test, y_test) = load_iris_data()?;
// 2. Basic kNN
let mut knn = KNearestNeighbors::new(3);
knn.fit(&x_train, &y_train)?;
let predictions = knn.predict(&x_test)?;
println!("Accuracy: {:.1}%", compute_accuracy(&predictions, &y_test) * 100.0);
// 3. Hyperparameter tuning
for k in [1, 3, 5, 7, 9] {
let mut knn = KNearestNeighbors::new(k);
knn.fit(&x_train, &y_train)?;
let acc = compute_accuracy(&knn.predict(&x_test)?, &y_test);
println!("k={}: {:.1}%", k, acc * 100.0);
}
// 4. Best model with weighted voting
let mut knn_best = KNearestNeighbors::new(5)
.with_weights(true);
knn_best.fit(&x_train, &y_train)?;
// 5. Probabilistic predictions
let probabilities = knn_best.predict_proba(&x_test)?;
for (i, &pred) in knn_best.predict(&x_test)?.iter().enumerate() {
println!("Sample {}: class={}, confidence={:.1}%",
i, pred, probabilities[i][pred] * 100.0);
}
Further Exploration
Try different k values:
// Very small k (high variance)
let knn1 = KNearestNeighbors::new(1); // Perfect training fit
// Very large k (high bias)
let knn15 = KNearestNeighbors::new(15); // 75% of training data
Feature importance analysis:
- Remove one feature at a time
- Measure impact on accuracy
- Identify most discriminative features (likely petal dimensions)
Cross-validation:
- Split data into 5 folds
- Average accuracy across folds
- More robust performance estimate than single train/test split
Standardization effect:
- Compare with/without StandardScaler
- Iris features are already similar scale (all in cm)
- Expect minimal difference, but good practice
Related Examples
examples/iris_clustering.rs- K-Means on same datasetbook/src/ml-fundamentals/knn.md- Full kNN theoryexamples/logistic-regression.md- Parametric alternative