Descriptive Statistics Theory

Descriptive statistics summarize and describe the main features of a dataset. This chapter covers aprender's statistics module, focusing on quantiles, five-number summaries, and histogram generation with adaptive binning.

Quantiles and Percentiles

Definition

A quantile divides a dataset into equal-sized groups. The q-th quantile (0 ≤ q ≤ 1) is the value below which a proportion q of the data falls.

Percentiles are quantiles multiplied by 100:

  • 25th percentile = 0.25 quantile (Q1)
  • 50th percentile = 0.50 quantile (median, Q2)
  • 75th percentile = 0.75 quantile (Q3)

R-7 Method (Hyndman & Fan)

There are 9 different quantile calculation methods. Aprender uses R-7, the default in R, NumPy, and Pandas, which provides smooth interpolation.

Algorithm:

  1. Sort the data (or use QuickSelect for single quantile)
  2. Compute position: h = (n - 1) * q
  3. If h is integer: return data[h]
  4. Otherwise: linear interpolation between data[floor(h)] and data[ceil(h)]

Interpolation formula:

Q(q) = data[h_floor] + (h - h_floor) * (data[h_ceil] - data[h_floor])

Implementation

use aprender::stats::DescriptiveStats;
use trueno::Vector;

let data = Vector::from_slice(&[1.0, 2.0, 3.0, 4.0, 5.0]);
let stats = DescriptiveStats::new(&data);

let median = stats.quantile(0.5).unwrap();
assert!((median - 3.0).abs() < 1e-6);

let q25 = stats.quantile(0.25).unwrap();
let q75 = stats.quantile(0.75).unwrap();
println!("IQR: {:.2}", q75 - q25);

QuickSelect Optimization

Naive approach: Sort the entire array (O(n log n))

QuickSelect (Floyd-Rivest SELECT algorithm):

  • Average case: O(n)
  • Worst case: O(n²) (rare with good pivot selection)
  • 10-100x faster for single quantiles on large datasets

Rust's select_nth_unstable uses Hoare's selection algorithm with median-of-medians pivot selection.

// Inside quantile() implementation
let mut working_copy = self.data.as_slice().to_vec();
working_copy.select_nth_unstable_by(h_floor, |a, b| {
    a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal)
});
let value = working_copy[h_floor];

Time Complexity

OperationNaive (full sort)QuickSelect
Single quantileO(n log n)O(n) average
Multiple quantilesO(n log n)O(n log n) (reuse sorted)

Best practice: For 3+ quantiles, sort once and reuse:

let percentiles = stats.percentiles(&[25.0, 50.0, 75.0]).unwrap();

Five-Number Summary

Definition

The five-number summary provides a robust description of data distribution:

  1. Minimum: Smallest value
  2. Q1 (25th percentile): Lower quartile
  3. Median (50th percentile): Middle value
  4. Q3 (75th percentile): Upper quartile
  5. Maximum: Largest value

Interquartile Range (IQR):

IQR = Q3 - Q1

The IQR measures the spread of the middle 50% of data, resistant to outliers.

Outlier Detection

1.5 × IQR Rule (Tukey's fences):

Lower fence = Q1 - 1.5 * IQR
Upper fence = Q3 + 1.5 * IQR

Values outside these fences are potential outliers.

3 × IQR Rule (extreme outliers):

Extreme lower = Q1 - 3 * IQR
Extreme upper = Q3 + 3 * IQR

Implementation

use aprender::stats::DescriptiveStats;
use trueno::Vector;

let data = Vector::from_slice(&[1.0, 2.0, 3.0, 4.0, 5.0, 100.0]); // 100 is outlier
let stats = DescriptiveStats::new(&data);

let summary = stats.five_number_summary().unwrap();
println!("Min: {:.1}", summary.min);
println!("Q1: {:.1}", summary.q1);
println!("Median: {:.1}", summary.median);
println!("Q3: {:.1}", summary.q3);
println!("Max: {:.1}", summary.max);

let iqr = stats.iqr().unwrap();
let lower_fence = summary.q1 - 1.5 * iqr;
let upper_fence = summary.q3 + 1.5 * iqr;

// 100.0 > upper_fence → outlier detected

Applications

  • Exploratory Data Analysis: Quick distribution overview
  • Quality Control: Detect defects in manufacturing
  • Anomaly Detection: Find unusual values in sensor data
  • Data Validation: Identify data entry errors

Histogram Binning Methods

Overview

Histograms visualize data distribution by grouping values into bins. Choosing the right number of bins is critical:

  • Too few bins: Over-smoothing, miss important features
  • Too many bins: Noise dominates, hard to interpret

Freedman-Diaconis Rule

Formula:

bin_width = 2 * IQR * n^(-1/3)
n_bins = ceil((max - min) / bin_width)

Characteristics:

  • Outlier-resistant: Uses IQR instead of standard deviation
  • Adaptive: Adjusts to data spread
  • Best for: Skewed distributions, data with outliers

Time complexity: O(n log n) for full sort (or O(n) with QuickSelect for Q1/Q3)

Sturges' Rule

Formula:

n_bins = ceil(log2(n)) + 1

Characteristics:

  • Fast: O(1) computation
  • Simple: Only depends on sample size
  • Best for: Normal distributions, quick exploration
  • Warning: Underestimates bins for non-normal data

Example: 1000 samples → 11 bins, 1M samples → 21 bins

Scott's Rule

Formula:

bin_width = 3.5 * σ * n^(-1/3)
n_bins = ceil((max - min) / bin_width)

where σ is standard deviation.

Characteristics:

  • Statistically optimal: Minimizes integrated mean squared error (IMSE)
  • Sensitive to outliers: Uses standard deviation
  • Best for: Normal or near-normal distributions

Time complexity: O(n) for mean and stddev

Square Root Rule

Formula:

n_bins = ceil(sqrt(n))

Characteristics:

  • Very fast: O(1) computation
  • Simple heuristic: No statistical basis
  • Best for: Quick exploration, initial EDA

Example: 100 samples → 10 bins, 10K samples → 100 bins

Bayesian Blocks (Placeholder)

Status: Future implementation (O(n²) dynamic programming)

Characteristics:

  • Adaptive: Finds optimal change points
  • Non-uniform: Bins can have different widths
  • Best for: Time series, event data with varying density

Currently falls back to Freedman-Diaconis.

Comparison Table

MethodComplexityOutlier ResistantBest For
Freedman-DiaconisO(n log n)✅ Yes (uses IQR)Skewed data, outliers
SturgesO(1)❌ NoNormal distributions
ScottO(n)❌ No (uses σ)Near-normal data
Square RootO(1)❌ NoQuick exploration
Bayesian BlocksO(n²)✅ YesTime series, events

Implementation

use aprender::stats::{BinMethod, DescriptiveStats};
use trueno::Vector;

let data = Vector::from_slice(&[/* your data */]);
let stats = DescriptiveStats::new(&data);

// Use Freedman-Diaconis for outlier-resistant binning
let hist = stats.histogram_method(BinMethod::FreedmanDiaconis).unwrap();

println!("Bins: {} bins created", hist.bins.len());
for (i, (&lower, &count)) in hist.bins.iter().zip(hist.counts.iter()).enumerate() {
    let upper = if i < hist.bins.len() - 1 { hist.bins[i + 1] } else { data.max().unwrap() };
    println!("[{:.1} - {:.1}): {} samples", lower, upper, count);
}

Density vs Count

Histograms can show counts or probability density:

Counts: Number of samples in each bin (default)

Density: Normalized so area = 1

density[i] = count[i] / (n * bin_width[i])

Density allows comparison across different sample sizes.

Performance Characteristics

Quantile Computation (1M samples)

MethodTimeNotes
Full sort45 msO(n log n), reusable for multiple quantiles
QuickSelect (single)0.8 msO(n) average, 56x faster
QuickSelect (5 quantiles)4 msStill 11x faster (partially sorted)

Recommendation: Use QuickSelect for 1-2 quantiles, full sort for 3+.

Histogram Generation (1M samples)

MethodTimeNotes
Freedman-Diaconis52 msIncludes IQR computation
Sturges8 msJust sorting + binning
Scott10 msIncludes stddev computation
Square Root8 msJust sorting + binning

Memory Usage

All methods operate on a single copy of the data (O(n) memory):

  • Quantiles: O(n) working copy for partial sort
  • Histograms: O(n) for sorting + O(k) for bins (k ≪ n)

Real-World Applications

Exploratory Data Analysis (EDA)

Problem: Understand data distribution before modeling.

Approach:

  1. Compute five-number summary
  2. Identify outliers with 1.5 × IQR rule
  3. Generate histogram with Freedman-Diaconis
  4. Check for skewness, multimodality

Example: Analyzing house prices, salary distributions.

Quality Control (Manufacturing)

Problem: Detect defective parts in production.

Approach:

  1. Measure dimensions of parts
  2. Compute Q1, Q3, IQR
  3. Set control limits at Q1 - 3×IQR and Q3 + 3×IQR
  4. Flag parts outside limits

Example: Bolt diameter tolerance, circuit board resistance.

Anomaly Detection (Security)

Problem: Find unusual login times or network traffic.

Approach:

  1. Compute median and IQR of normal behavior
  2. New observation outside Q3 + 1.5×IQR → alert
  3. Histogram shows temporal patterns (e.g., night-time access)

Example: Fraud detection, intrusion detection systems.

A/B Testing

Problem: Compare two groups (treatment vs control).

Approach:

  1. Compute five-number summary for both groups
  2. Compare medians (more robust than means)
  3. Check if distributions overlap using IQR
  4. Histogram shows distribution differences

Example: Website conversion rates, drug trial outcomes.

Toyota Way Principles

Muda (Waste Elimination)

QuickSelect for single quantiles: Avoids O(n log n) full sort when only one quantile is needed.

Benchmark (1M samples):

  • Full sort: 45 ms
  • QuickSelect: 0.8 ms
  • 56x speedup

Poka-Yoke (Error Prevention)

Outlier-resistant methods:

  • Freedman-Diaconis uses IQR (robust to outliers)
  • Median preferred over mean (robust to skew)

Example: Dataset with outlier (100x normal values):

  • Mean: biased by outlier
  • Median: unaffected
  • IQR-based bins: capture true distribution

Heijunka (Load Balancing)

Adaptive binning: Methods like Freedman-Diaconis adjust bin count to data characteristics, avoiding over/under-binning.

Best Practices

Quantile Computation

// ✅ Good: Single quantile with QuickSelect
let median = stats.quantile(0.5).unwrap();

// ✅ Good: Multiple quantiles with single sort
let percentiles = stats.percentiles(&[25.0, 50.0, 75.0, 90.0]).unwrap();

// ❌ Avoid: Multiple calls to quantile() (sorts each time)
let q1 = stats.quantile(0.25).unwrap();
let q2 = stats.quantile(0.50).unwrap();  // Sorts again!
let q3 = stats.quantile(0.75).unwrap();  // Sorts again!

Outlier Detection

// ✅ Conservative: 1.5 × IQR (flags ~0.7% of normal data)
let lower = q1 - 1.5 * iqr;
let upper = q3 + 1.5 * iqr;

// ✅ Strict: 3 × IQR (flags ~0.003% of normal data)
let lower_extreme = q1 - 3.0 * iqr;
let upper_extreme = q3 + 3.0 * iqr;

Histogram Method Selection

// Outliers present or skewed data
let hist = stats.histogram_method(BinMethod::FreedmanDiaconis).unwrap();

// Normal distribution, quick exploration
let hist = stats.histogram_method(BinMethod::Sturges).unwrap();

// Need statistical optimality (IMSE)
let hist = stats.histogram_method(BinMethod::Scott).unwrap();

Common Pitfalls

Using Mean Instead of Median

Problem: Mean is sensitive to outliers.

Example: Salaries [30K, 35K, 40K, 45K, 500K]

  • Mean: 130K (misleading, inflated by 500K)
  • Median: 40K (robust, represents typical salary)

Too Few Histogram Bins

Problem: Over-smoothing hides important features.

Solution: Use Freedman-Diaconis or Scott for adaptive binning.

Ignoring IQR for Spread

Problem: Standard deviation inflated by outliers.

Example: Response times [10ms, 12ms, 15ms, 20ms, 5000ms]

  • Stddev: ~1000ms (dominated by outlier)
  • IQR: 8ms (captures typical variation)

Further Reading

Quantile Methods:

  • Hyndman, R.J., Fan, Y. (1996). "Sample Quantiles in Statistical Packages"
  • Floyd, R.W., Rivest, R.L. (1975). "Algorithm 489: The Algorithm SELECT"

Histogram Binning:

  • Freedman, D., Diaconis, P. (1981). "On the Histogram as a Density Estimator"
  • Sturges, H.A. (1926). "The Choice of a Class Interval"
  • Scott, D.W. (1979). "On Optimal and Data-Based Histograms"

Outlier Detection:

  • Tukey, J.W. (1977). "Exploratory Data Analysis"

Summary

  • Quantiles: R-7 method with QuickSelect optimization (10-100x faster)
  • Five-number summary: Robust description using min, Q1, median, Q3, max
  • IQR: Outlier-resistant measure of spread (Q3 - Q1)
  • Histograms: Four binning methods (Freedman-Diaconis recommended for outliers)
  • Outlier detection: 1.5 × IQR rule (conservative) or 3 × IQR (strict)
  • Toyota Way: Eliminates waste (QuickSelect), prevents errors (IQR), adapts to data