Classification Metrics Theory

Chapter Status: ✅ 100% Working (All metrics verified)

Status	Count	Examples
✅ Working	4+	All verified in src/metrics/mod.rs
⏳ In Progress	0	-
⬜ Not Implemented	0	-

Last tested: 2025-11-19 Aprender version: 0.3.0 Test file: src/metrics/mod.rs tests

Overview

Classification metrics evaluate how well a model predicts discrete classes. Unlike regression, we're not measuring "how far off"—we're measuring "right or wrong."

Key Metrics:

Accuracy: Fraction of correct predictions
Precision: Of predicted positives, how many are correct?
Recall: Of actual positives, how many did we find?
F1 Score: Harmonic mean of precision and recall

Why This Matters: Accuracy alone can be misleading. A spam filter with 99% accuracy that marks all email as "not spam" is useless. We need precision and recall to understand performance fully.

Mathematical Foundation

The Confusion Matrix

All classification metrics derive from the confusion matrix:

                Predicted
                Pos    Neg
Actual  Pos    TP     FN
        Neg    FP     TN

TP = True Positives  (correctly predicted positive)
TN = True Negatives  (correctly predicted negative)
FP = False Positives (incorrectly predicted positive - Type I error)
FN = False Negatives (incorrectly predicted negative - Type II error)

Accuracy

Definition:

Accuracy = (TP + TN) / (TP + TN + FP + FN)
         = Correct / Total

Range: [0, 1], higher is better

Weakness: Misleading with imbalanced classes

Example:

Dataset: 95% negative, 5% positive
Model: Always predict negative
Accuracy = 95% (looks good!)
But: Model is useless (finds zero positives)

Precision

Definition:

Precision = TP / (TP + FP)
          = True Positives / All Predicted Positives

Interpretation: "Of all items I labeled positive, what fraction are actually positive?"

Use Case: When false positives are costly

Spam filter marking important email as spam
Medical diagnosis triggering unnecessary treatment

Recall (Sensitivity, True Positive Rate)

Definition:

Recall = TP / (TP + FN)
       = True Positives / All Actual Positives

Interpretation: "Of all actual positives, what fraction did I find?"

Use Case: When false negatives are costly

Cancer screening missing actual cases
Fraud detection missing actual fraud

F1 Score

Definition:

F1 = 2 * (Precision * Recall) / (Precision + Recall)
   = Harmonic mean of Precision and Recall

Why harmonic mean? Punishes extreme imbalance. If either precision or recall is very low, F1 is low.

Example:

Precision = 1.0, Recall = 0.01 → Arithmetic mean = 0.505 (misleading)
F1 = 2 * (1.0 * 0.01) / (1.0 + 0.01) = 0.02 (realistic)

Implementation in Aprender

Example: Binary Classification Metrics

use aprender::metrics::{accuracy, precision, recall, f1_score};
use aprender::primitives::Vector;

let y_true = Vector::from_vec(vec![1.0, 0.0, 1.0, 1.0, 0.0, 1.0]);
let y_pred = Vector::from_vec(vec![1.0, 0.0, 0.0, 1.0, 0.0, 1.0]);
//                                  TP   TN   FN   TP   TN   TP
// Confusion Matrix:
// TP = 3, TN = 2, FP = 0, FN = 1

// Accuracy: (3+2)/(3+2+0+1) = 5/6 = 0.833
let acc = accuracy(&y_true, &y_pred);
println!("Accuracy: {:.3}", acc); // 0.833

// Precision: 3/(3+0) = 1.0 (no false positives)
let prec = precision(&y_true, &y_pred);
println!("Precision: {:.3}", prec); // 1.000

// Recall: 3/(3+1) = 0.75 (one false negative)
let rec = recall(&y_true, &y_pred);
println!("Recall: {:.3}", rec); // 0.750

// F1: 2*(1.0*0.75)/(1.0+0.75) = 0.857
let f1 = f1_score(&y_true, &y_pred);
println!("F1: {:.3}", f1); // 0.857

Test References:

src/metrics/mod.rs::tests::test_accuracy
src/metrics/mod.rs::tests::test_precision
src/metrics/mod.rs::tests::test_recall
src/metrics/mod.rs::tests::test_f1_score

Choosing the Right Metric

Decision Guide

Are classes balanced (roughly 50/50)?
├─ YES → Accuracy is reasonable
└─ NO → Use Precision/Recall/F1

Which error is more costly?
├─ False Positives worse → Maximize Precision
├─ False Negatives worse → Maximize Recall
└─ Both equally bad → Maximize F1

Examples:
- Email spam (FP bad): High Precision
- Cancer screening (FN bad): High Recall
- General classification: F1 Score

Metric Comparison

Metric	Formula	Range	Best For	Weakness
Accuracy	(TP+TN)/Total	[0,1]	Balanced classes	Imbalanced data
Precision	TP/(TP+FP)	[0,1]	Minimizing FP	Ignores FN
Recall	TP/(TP+FN)	[0,1]	Minimizing FN	Ignores FP
F1	2PR/(P+R)	[0,1]	Balancing P&R	Equal weight to P&R

Precision-Recall Trade-off

Key Insight: You can't maximize both precision and recall simultaneously (except for perfect classifier).

Example: Spam Filter Threshold

Threshold | Precision | Recall | F1
----------|-----------|--------|----
  0.9     |   0.95    |  0.60  | 0.74  (conservative)
  0.5     |   0.80    |  0.85  | 0.82  (balanced)
  0.1     |   0.50    |  0.98  | 0.66  (aggressive)

Choosing threshold:

High threshold → High precision, low recall (few predictions, mostly correct)
Low threshold → Low precision, high recall (many predictions, some wrong)
Middle ground → Maximize F1

Practical Considerations

Imbalanced Classes

Problem: 1% positive class (fraud detection, rare disease)

Bad Baseline:

// Always predict negative
// Accuracy = 99% (misleading!)
// Recall = 0% (finds no positives - useless)

Solution: Use precision, recall, F1 instead of accuracy

Multi-class Classification

For multi-class, compute metrics per class then average:

Macro-average: Average across classes (each class weighted equally)
Micro-average: Aggregate TP/FP/FN across all classes

Example (3 classes):

Class A: Precision = 0.9
Class B: Precision = 0.8
Class C: Precision = 0.5

Macro-avg Precision = (0.9 + 0.8 + 0.5) / 3 = 0.73

Verification Through Tests

Classification metrics have comprehensive test coverage:

Property Tests:

Perfect predictions → All metrics = 1.0
All wrong predictions → All metrics = 0.0
Metrics are in [0, 1] range
F1 ≤ min(Precision, Recall)

Test Reference: src/metrics/mod.rs validates these properties

Real-World Application

Evaluating Logistic Regression

use aprender::classification::LogisticRegression;
use aprender::metrics::{accuracy, precision, recall, f1_score};
use aprender::traits::Classifier;

// Train model
let mut model = LogisticRegression::new();
model.fit(&x_train, &y_train).unwrap();

// Predict on test set
let y_pred = model.predict(&x_test);

// Evaluate with multiple metrics
let acc = accuracy(&y_test, &y_pred);
let prec = precision(&y_test, &y_pred);
let rec = recall(&y_test, &y_pred);
let f1 = f1_score(&y_test, &y_pred);

println!("Accuracy:  {:.3}", acc);   // e.g., 0.892
println!("Precision: {:.3}", prec);  // e.g., 0.875
println!("Recall:    {:.3}", rec);   // e.g., 0.910
println!("F1 Score:  {:.3}", f1);    // e.g., 0.892

// Decision: F1 > 0.85 → Accept model

Case Study: Logistic Regression uses these metrics

Summary

What You Learned:

✅ Confusion matrix: TP, TN, FP, FN
✅ Accuracy: Simple but misleading with imbalance
✅ Precision: Minimizes false positives
✅ Recall: Minimizes false negatives
✅ F1: Balances precision and recall
✅ Choose metric based on: class balance, cost of errors

Verification Guarantee: All classification metrics extensively tested (10+ tests) in src/metrics/mod.rs. Property tests verify mathematical properties.

Quick Reference:

Balanced classes: Accuracy
Imbalanced classes: Precision/Recall/F1
FP costly: Precision
FN costly: Recall
Balance both: F1

Next Chapter: Cross-Validation Theory

Previous Chapter: Logistic Regression Theory

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning

Classification Metrics Theory

Overview

Mathematical Foundation

The Confusion Matrix

Accuracy

Precision

Recall (Sensitivity, True Positive Rate)

F1 Score

Implementation in Aprender

Example: Binary Classification Metrics

Choosing the Right Metric

Decision Guide

Metric Comparison

Precision-Recall Trade-off

Example: Spam Filter Threshold

Practical Considerations

Imbalanced Classes

Multi-class Classification

Verification Through Tests

Real-World Application

Evaluating Logistic Regression

Further Reading

Peer-Reviewed Paper

Summary