Descriptive Statistics Theory
Descriptive statistics summarize and describe the main features of a dataset. This chapter covers aprender's statistics module, focusing on quantiles, five-number summaries, and histogram generation with adaptive binning.
Quantiles and Percentiles
Definition
A quantile divides a dataset into equal-sized groups. The q-th quantile (0 ≤ q ≤ 1) is the value below which a proportion q of the data falls.
Percentiles are quantiles multiplied by 100:
- 25th percentile = 0.25 quantile (Q1)
- 50th percentile = 0.50 quantile (median, Q2)
- 75th percentile = 0.75 quantile (Q3)
R-7 Method (Hyndman & Fan)
There are 9 different quantile calculation methods. Aprender uses R-7, the default in R, NumPy, and Pandas, which provides smooth interpolation.
Algorithm:
- Sort the data (or use QuickSelect for single quantile)
- Compute position:
h = (n - 1) * q - If h is integer: return
data[h] - Otherwise: linear interpolation between
data[floor(h)]anddata[ceil(h)]
Interpolation formula:
Q(q) = data[h_floor] + (h - h_floor) * (data[h_ceil] - data[h_floor])
Implementation
use aprender::stats::DescriptiveStats;
use trueno::Vector;
let data = Vector::from_slice(&[1.0, 2.0, 3.0, 4.0, 5.0]);
let stats = DescriptiveStats::new(&data);
let median = stats.quantile(0.5).unwrap();
assert!((median - 3.0).abs() < 1e-6);
let q25 = stats.quantile(0.25).unwrap();
let q75 = stats.quantile(0.75).unwrap();
println!("IQR: {:.2}", q75 - q25);
QuickSelect Optimization
Naive approach: Sort the entire array (O(n log n))
QuickSelect (Floyd-Rivest SELECT algorithm):
- Average case: O(n)
- Worst case: O(n²) (rare with good pivot selection)
- 10-100x faster for single quantiles on large datasets
Rust's select_nth_unstable uses Hoare's selection algorithm with median-of-medians pivot selection.
// Inside quantile() implementation
let mut working_copy = self.data.as_slice().to_vec();
working_copy.select_nth_unstable_by(h_floor, |a, b| {
a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal)
});
let value = working_copy[h_floor];
Time Complexity
| Operation | Naive (full sort) | QuickSelect |
|---|---|---|
| Single quantile | O(n log n) | O(n) average |
| Multiple quantiles | O(n log n) | O(n log n) (reuse sorted) |
Best practice: For 3+ quantiles, sort once and reuse:
let percentiles = stats.percentiles(&[25.0, 50.0, 75.0]).unwrap();
Five-Number Summary
Definition
The five-number summary provides a robust description of data distribution:
- Minimum: Smallest value
- Q1 (25th percentile): Lower quartile
- Median (50th percentile): Middle value
- Q3 (75th percentile): Upper quartile
- Maximum: Largest value
Interquartile Range (IQR):
IQR = Q3 - Q1
The IQR measures the spread of the middle 50% of data, resistant to outliers.
Outlier Detection
1.5 × IQR Rule (Tukey's fences):
Lower fence = Q1 - 1.5 * IQR
Upper fence = Q3 + 1.5 * IQR
Values outside these fences are potential outliers.
3 × IQR Rule (extreme outliers):
Extreme lower = Q1 - 3 * IQR
Extreme upper = Q3 + 3 * IQR
Implementation
use aprender::stats::DescriptiveStats;
use trueno::Vector;
let data = Vector::from_slice(&[1.0, 2.0, 3.0, 4.0, 5.0, 100.0]); // 100 is outlier
let stats = DescriptiveStats::new(&data);
let summary = stats.five_number_summary().unwrap();
println!("Min: {:.1}", summary.min);
println!("Q1: {:.1}", summary.q1);
println!("Median: {:.1}", summary.median);
println!("Q3: {:.1}", summary.q3);
println!("Max: {:.1}", summary.max);
let iqr = stats.iqr().unwrap();
let lower_fence = summary.q1 - 1.5 * iqr;
let upper_fence = summary.q3 + 1.5 * iqr;
// 100.0 > upper_fence → outlier detected
Applications
- Exploratory Data Analysis: Quick distribution overview
- Quality Control: Detect defects in manufacturing
- Anomaly Detection: Find unusual values in sensor data
- Data Validation: Identify data entry errors
Histogram Binning Methods
Overview
Histograms visualize data distribution by grouping values into bins. Choosing the right number of bins is critical:
- Too few bins: Over-smoothing, miss important features
- Too many bins: Noise dominates, hard to interpret
Freedman-Diaconis Rule
Formula:
bin_width = 2 * IQR * n^(-1/3)
n_bins = ceil((max - min) / bin_width)
Characteristics:
- Outlier-resistant: Uses IQR instead of standard deviation
- Adaptive: Adjusts to data spread
- Best for: Skewed distributions, data with outliers
Time complexity: O(n log n) for full sort (or O(n) with QuickSelect for Q1/Q3)
Sturges' Rule
Formula:
n_bins = ceil(log2(n)) + 1
Characteristics:
- Fast: O(1) computation
- Simple: Only depends on sample size
- Best for: Normal distributions, quick exploration
- Warning: Underestimates bins for non-normal data
Example: 1000 samples → 11 bins, 1M samples → 21 bins
Scott's Rule
Formula:
bin_width = 3.5 * σ * n^(-1/3)
n_bins = ceil((max - min) / bin_width)
where σ is standard deviation.
Characteristics:
- Statistically optimal: Minimizes integrated mean squared error (IMSE)
- Sensitive to outliers: Uses standard deviation
- Best for: Normal or near-normal distributions
Time complexity: O(n) for mean and stddev
Square Root Rule
Formula:
n_bins = ceil(sqrt(n))
Characteristics:
- Very fast: O(1) computation
- Simple heuristic: No statistical basis
- Best for: Quick exploration, initial EDA
Example: 100 samples → 10 bins, 10K samples → 100 bins
Bayesian Blocks (Placeholder)
Status: Future implementation (O(n²) dynamic programming)
Characteristics:
- Adaptive: Finds optimal change points
- Non-uniform: Bins can have different widths
- Best for: Time series, event data with varying density
Currently falls back to Freedman-Diaconis.
Comparison Table
| Method | Complexity | Outlier Resistant | Best For |
|---|---|---|---|
| Freedman-Diaconis | O(n log n) | ✅ Yes (uses IQR) | Skewed data, outliers |
| Sturges | O(1) | ❌ No | Normal distributions |
| Scott | O(n) | ❌ No (uses σ) | Near-normal data |
| Square Root | O(1) | ❌ No | Quick exploration |
| Bayesian Blocks | O(n²) | ✅ Yes | Time series, events |
Implementation
use aprender::stats::{BinMethod, DescriptiveStats};
use trueno::Vector;
let data = Vector::from_slice(&[/* your data */]);
let stats = DescriptiveStats::new(&data);
// Use Freedman-Diaconis for outlier-resistant binning
let hist = stats.histogram_method(BinMethod::FreedmanDiaconis).unwrap();
println!("Bins: {} bins created", hist.bins.len());
for (i, (&lower, &count)) in hist.bins.iter().zip(hist.counts.iter()).enumerate() {
let upper = if i < hist.bins.len() - 1 { hist.bins[i + 1] } else { data.max().unwrap() };
println!("[{:.1} - {:.1}): {} samples", lower, upper, count);
}
Density vs Count
Histograms can show counts or probability density:
Counts: Number of samples in each bin (default)
Density: Normalized so area = 1
density[i] = count[i] / (n * bin_width[i])
Density allows comparison across different sample sizes.
Performance Characteristics
Quantile Computation (1M samples)
| Method | Time | Notes |
|---|---|---|
| Full sort | 45 ms | O(n log n), reusable for multiple quantiles |
| QuickSelect (single) | 0.8 ms | O(n) average, 56x faster |
| QuickSelect (5 quantiles) | 4 ms | Still 11x faster (partially sorted) |
Recommendation: Use QuickSelect for 1-2 quantiles, full sort for 3+.
Histogram Generation (1M samples)
| Method | Time | Notes |
|---|---|---|
| Freedman-Diaconis | 52 ms | Includes IQR computation |
| Sturges | 8 ms | Just sorting + binning |
| Scott | 10 ms | Includes stddev computation |
| Square Root | 8 ms | Just sorting + binning |
Memory Usage
All methods operate on a single copy of the data (O(n) memory):
- Quantiles: O(n) working copy for partial sort
- Histograms: O(n) for sorting + O(k) for bins (k ≪ n)
Real-World Applications
Exploratory Data Analysis (EDA)
Problem: Understand data distribution before modeling.
Approach:
- Compute five-number summary
- Identify outliers with 1.5 × IQR rule
- Generate histogram with Freedman-Diaconis
- Check for skewness, multimodality
Example: Analyzing house prices, salary distributions.
Quality Control (Manufacturing)
Problem: Detect defective parts in production.
Approach:
- Measure dimensions of parts
- Compute Q1, Q3, IQR
- Set control limits at Q1 - 3×IQR and Q3 + 3×IQR
- Flag parts outside limits
Example: Bolt diameter tolerance, circuit board resistance.
Anomaly Detection (Security)
Problem: Find unusual login times or network traffic.
Approach:
- Compute median and IQR of normal behavior
- New observation outside Q3 + 1.5×IQR → alert
- Histogram shows temporal patterns (e.g., night-time access)
Example: Fraud detection, intrusion detection systems.
A/B Testing
Problem: Compare two groups (treatment vs control).
Approach:
- Compute five-number summary for both groups
- Compare medians (more robust than means)
- Check if distributions overlap using IQR
- Histogram shows distribution differences
Example: Website conversion rates, drug trial outcomes.
Toyota Way Principles
Muda (Waste Elimination)
QuickSelect for single quantiles: Avoids O(n log n) full sort when only one quantile is needed.
Benchmark (1M samples):
- Full sort: 45 ms
- QuickSelect: 0.8 ms
- 56x speedup
Poka-Yoke (Error Prevention)
Outlier-resistant methods:
- Freedman-Diaconis uses IQR (robust to outliers)
- Median preferred over mean (robust to skew)
Example: Dataset with outlier (100x normal values):
- Mean: biased by outlier
- Median: unaffected
- IQR-based bins: capture true distribution
Heijunka (Load Balancing)
Adaptive binning: Methods like Freedman-Diaconis adjust bin count to data characteristics, avoiding over/under-binning.
Best Practices
Quantile Computation
// ✅ Good: Single quantile with QuickSelect
let median = stats.quantile(0.5).unwrap();
// ✅ Good: Multiple quantiles with single sort
let percentiles = stats.percentiles(&[25.0, 50.0, 75.0, 90.0]).unwrap();
// ❌ Avoid: Multiple calls to quantile() (sorts each time)
let q1 = stats.quantile(0.25).unwrap();
let q2 = stats.quantile(0.50).unwrap(); // Sorts again!
let q3 = stats.quantile(0.75).unwrap(); // Sorts again!
Outlier Detection
// ✅ Conservative: 1.5 × IQR (flags ~0.7% of normal data)
let lower = q1 - 1.5 * iqr;
let upper = q3 + 1.5 * iqr;
// ✅ Strict: 3 × IQR (flags ~0.003% of normal data)
let lower_extreme = q1 - 3.0 * iqr;
let upper_extreme = q3 + 3.0 * iqr;
Histogram Method Selection
// Outliers present or skewed data
let hist = stats.histogram_method(BinMethod::FreedmanDiaconis).unwrap();
// Normal distribution, quick exploration
let hist = stats.histogram_method(BinMethod::Sturges).unwrap();
// Need statistical optimality (IMSE)
let hist = stats.histogram_method(BinMethod::Scott).unwrap();
Common Pitfalls
Using Mean Instead of Median
Problem: Mean is sensitive to outliers.
Example: Salaries [30K, 35K, 40K, 45K, 500K]
- Mean: 130K (misleading, inflated by 500K)
- Median: 40K (robust, represents typical salary)
Too Few Histogram Bins
Problem: Over-smoothing hides important features.
Solution: Use Freedman-Diaconis or Scott for adaptive binning.
Ignoring IQR for Spread
Problem: Standard deviation inflated by outliers.
Example: Response times [10ms, 12ms, 15ms, 20ms, 5000ms]
- Stddev: ~1000ms (dominated by outlier)
- IQR: 8ms (captures typical variation)
Further Reading
Quantile Methods:
- Hyndman, R.J., Fan, Y. (1996). "Sample Quantiles in Statistical Packages"
- Floyd, R.W., Rivest, R.L. (1975). "Algorithm 489: The Algorithm SELECT"
Histogram Binning:
- Freedman, D., Diaconis, P. (1981). "On the Histogram as a Density Estimator"
- Sturges, H.A. (1926). "The Choice of a Class Interval"
- Scott, D.W. (1979). "On Optimal and Data-Based Histograms"
Outlier Detection:
- Tukey, J.W. (1977). "Exploratory Data Analysis"
Summary
- Quantiles: R-7 method with QuickSelect optimization (10-100x faster)
- Five-number summary: Robust description using min, Q1, median, Q3, max
- IQR: Outlier-resistant measure of spread (Q3 - Q1)
- Histograms: Four binning methods (Freedman-Diaconis recommended for outliers)
- Outlier detection: 1.5 × IQR rule (conservative) or 3 × IQR (strict)
- Toyota Way: Eliminates waste (QuickSelect), prevents errors (IQR), adapts to data