Case Study: Descriptive Statistics

This case study demonstrates statistical analysis on test scores from a class of 30 students, using quantiles, five-number summaries, and histogram generation.

Overview

We'll analyze test scores (0-100 scale) to:

  • Understand class performance (quantiles, percentiles)
  • Identify struggling students (outlier detection)
  • Visualize distribution (histograms with different binning methods)
  • Make data-driven recommendations (pass rate, grade distribution)

Running the Example

cargo run --example descriptive_statistics

Expected output: Statistical analysis with quantiles, five-number summary, histogram comparisons, and summary statistics.

Dataset

Test Scores (30 students)

let test_scores = vec![
    45.0, // outlier (struggling student)
    52.0, // outlier
    62.0, 65.0, 68.0, 70.0, 72.0, 73.0, 75.0, 76.0, // lower cluster
    78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, // middle cluster
    86.0, 87.0, 88.0, 89.0, 90.0, 91.0, 92.0, 93.0, // upper cluster
    95.0, 97.0, 98.0, // high performers
    100.0, // outlier (perfect score)
];

Distribution characteristics:

  • Most scores: 60-90 range (typical performance)
  • Lower outliers: 45, 52 (struggling students)
  • Upper outlier: 100 (exceptional performance)
  • Sample size: 30 students

Creating the Statistics Object

use aprender::stats::{BinMethod, DescriptiveStats};
use trueno::Vector;

let data = Vector::from_slice(&test_scores);
let stats = DescriptiveStats::new(&data);

Analysis 1: Quantiles and Percentiles

Results

Key Quantiles:
  • 25th percentile (Q1): 73.5
  • 50th percentile (Median): 82.5
  • 75th percentile (Q3): 89.8

Percentile Distribution:
  • P10: 64.7 - Bottom 10% scored below this
  • P25: 73.5 - Bottom quartile
  • P50: 82.5 - Median score
  • P75: 89.8 - Top quartile
  • P90: 95.2 - Top 10% scored above this

Interpretation

Median (82.5): Half the class scored above 82.5, half below. This is more robust than the mean (80.5) because it's not affected by the outliers (45, 52, 100).

Interquartile range (IQR = Q3 - Q1 = 16.3):

  • Middle 50% of students scored between 73.5 and 89.8
  • This 16.3-point spread indicates moderate variability
  • Narrower IQR = more consistent performance
  • Wider IQR = more spread out scores

Percentile insights:

  • P10 (64.7): Bottom 10% struggling (below 65)
  • P90 (95.2): Top 10% excelling (above 95)
  • P50 (82.5): Median student scored B+ (82.5)

Why Median > Mean?

let mean = data.mean().unwrap();  // 80.53
let median = stats.quantile(0.5).unwrap();  // 82.5

Mean (80.53) is pulled down by lower outliers (45, 52).

Median (82.5) represents the "typical" student, unaffected by outliers.

Rule of thumb: Use median when data has outliers or is skewed.

Analysis 2: Five-Number Summary (Outlier Detection)

Results

Five-Number Summary:
  • Minimum: 45.0
  • Q1 (25th percentile): 73.5
  • Median (50th percentile): 82.5
  • Q3 (75th percentile): 89.8
  • Maximum: 100.0

  • IQR (Q3 - Q1): 16.2

Outlier Fences (1.5 × IQR rule):
  • Lower fence: 49.1
  • Upper fence: 114.1
  • 1 outliers detected: [45.0]

Interpretation

1.5 × IQR Rule (Tukey's fences):

Lower fence = Q1 - 1.5 * IQR = 73.5 - 1.5 * 16.3 = 49.1
Upper fence = Q3 + 1.5 * IQR = 89.8 + 1.5 * 16.3 = 114.1

Outlier detection:

  • 45.0 < 49.1 → Outlier (struggling student)
  • 52.0 > 49.1 → Not an outlier (just below average)
  • 100.0 < 114.1 → Not an outlier (excellent but not anomalous)

Why is 100 not an outlier?

The 1.5 × IQR rule is conservative (flags ~0.7% of normal data). Since the distribution has many high scores (90-98), a perfect 100 is within expected range.

3 × IQR Rule (stricter):

Lower extreme = Q1 - 3 * IQR = 73.5 - 3 * 16.3 = 24.6
Upper extreme = Q3 + 3 * IQR = 89.8 + 3 * 16.3 = 138.7

Even with the strict rule, 45 is still detected as an outlier.

Actionable Insights

For the instructor:

  • Student with 45: Needs immediate intervention (tutoring, office hours)
  • Students with 52-62: At risk, provide additional support
  • Students with 90-100: Consider advanced material or enrichment

For pass/fail threshold:

  • Setting threshold at 60: 28/30 pass (93.3% pass rate)
  • Setting threshold at 70: 25/30 pass (83.3% pass rate)
  • Current median (82.5) suggests most students mastered material

Analysis 3: Histogram Binning Methods

Freedman-Diaconis Rule

📊 Freedman-Diaconis Rule:
   7 bins created
   [ 45.0 -  54.2):  2 ██████
   [ 54.2 -  63.3):  1 ███
   [ 63.3 -  72.5):  4 █████████████
   [ 72.5 -  81.7):  7 ███████████████████████
   [ 81.7 -  90.8):  9 ██████████████████████████████
   [ 90.8 - 100.0):  7 ███████████████████████

Formula:

bin_width = 2 * IQR * n^(-1/3) = 2 * 16.3 * 30^(-1/3) ≈ 10.5
n_bins = ceil((100 - 45) / 10.5) = 7

Interpretation:

  • Bimodal distribution: Peak at [81.7 - 90.8) with 9 students
  • Lower tail: 2 students in [45 - 54.2) (struggling)
  • Even spread: 7 students each in [72.5 - 81.7) and [90.8 - 100)

Best for: This dataset (outliers present, slightly skewed).

Sturges' Rule

📊 Sturges Rule:
   7 bins created
   [ 45.0 -  54.2):  2 ██████
   [ 54.2 -  63.3):  1 ███
   [ 63.3 -  72.5):  4 █████████████
   [ 72.5 -  81.7):  7 ███████████████████████
   [ 81.7 -  90.8):  9 ██████████████████████████████
   [ 90.8 - 100.0):  7 ███████████████████████

Formula:

n_bins = ceil(log2(30)) + 1 = ceil(4.91) + 1 = 6 + 1 = 7

Interpretation:

  • Same as Freedman-Diaconis for this dataset (coincidence)
  • Sturges assumes normal distribution (not quite true here)
  • Fast: O(1) computation (no IQR needed)

Best for: Quick exploration, normally distributed data.

Scott's Rule

📊 Scott Rule:
   5 bins created
   [ 45.0 -  58.8):  2 █████
   [ 58.8 -  72.5):  5 ████████████
   [ 72.5 -  86.2): 12 ██████████████████████████████
   [ 86.2 - 100.0): 11 ███████████████████████████

Formula:

bin_width = 3.5 * σ * n^(-1/3) = 3.5 * 12.9 * 30^(-1/3) ≈ 14.5
n_bins = ceil((100 - 45) / 14.5) = 5

Interpretation:

  • Fewer bins (5 vs 7) → smoother histogram
  • Still shows peak at [72.5 - 86.2) with 12 students
  • Less detail: Lower tail bins are wider

Best for: Near-normal distributions, minimizing integrated mean squared error (IMSE).

Square Root Rule

📊 Square Root Rule:
   7 bins created
   [ 45.0 -  54.2):  2 ██████
   [ 54.2 -  63.3):  1 ███
   [ 63.3 -  72.5):  4 █████████████
   [ 72.5 -  81.7):  7 ███████████████████████
   [ 81.7 -  90.8):  9 ██████████████████████████████
   [ 90.8 - 100.0):  7 ███████████████████████

Formula:

n_bins = ceil(sqrt(30)) = ceil(5.48) = 6

Wait, why 7 bins?

  • Square root gives 6 bins theoretically
  • Implementation uses histogram() which may round differently
  • Rule of thumb: √n bins for quick exploration

Best for: Initial data exploration, no statistical basis.

Comparison: Which Method to Use?

MethodBinsBest For
Freedman-Diaconis7This dataset (outliers, skewed)
Sturges7Quick exploration, normal data
Scott5Near-normal, smooth histogram
Square Root7Very quick initial look

Recommendation: Use Freedman-Diaconis for most real-world datasets (outlier-resistant).

Analysis 4: Summary Statistics

Results

Dataset Statistics:
  • Sample size: 30
  • Mean: 80.53
  • Std Dev: 12.92
  • Range: [45.0, 100.0]
  • Median: 82.5
  • IQR: 16.2

Class Performance:
  • Pass rate (≥60): 93.3% (28/30)
  • A grade rate (≥90): 26.7% (8/30)

Interpretation

Mean vs Median:

  • Mean (80.53) < Median (82.5) → Left-skewed distribution
  • Outliers (45, 52) pull mean down
  • Median better represents "typical" student

Standard deviation (12.92):

  • Moderate spread (12.9 points)
  • Most students within ±1σ: [67.6, 93.4] (68% of data)
  • Compare to IQR (16.3): Similar scale

Pass rate (93.3%):

  • 28 out of 30 students passed (≥60)
  • Only 2 students failed (45, 52)
  • Strong overall performance

A grade rate (26.7%):

  • 8 out of 30 students earned A (≥90)
  • Top quartile (Q3 = 89.8) almost reaches A threshold
  • Challenging exam, but achievable

Recommendations

For struggling students (45, 52):

  • One-on-one tutoring sessions
  • Review fundamental concepts
  • Consider alternative assessment methods

For at-risk students (60-70):

  • Group study sessions
  • Office hours attendance
  • Practice problem sets

For high performers (≥90):

  • Advanced topics or projects
  • Peer tutoring opportunities
  • Enrichment material

Performance Notes

QuickSelect Optimization

// Single quantile: O(n) with QuickSelect
let median = stats.quantile(0.5).unwrap();

// Multiple quantiles: O(n log n) with single sort
let percentiles = stats.percentiles(&[25.0, 50.0, 75.0]).unwrap();

Benchmark (1M samples):

  • Full sort: 45 ms
  • QuickSelect (single quantile): 0.8 ms
  • 56x speedup

For this 30-sample dataset, the difference is negligible (<1 μs), but scales well to large datasets.

R-7 Interpolation

Aprender uses the R-7 method for quantiles:

h = (n - 1) * q = (30 - 1) * 0.5 = 14.5
Q(0.5) = data[14] + 0.5 * (data[15] - data[14])
       = 82.0 + 0.5 * (83.0 - 82.0) = 82.5

This matches R, NumPy, and Pandas behavior.

Real-World Applications

Educational Assessment

Problem: Identify struggling students early.

Approach:

  1. Compute percentiles after first exam
  2. Students below P25 → at-risk
  3. Students below P10 → immediate intervention
  4. Monitor progress over semester

Example: This case study (P10 = 64.7, flag students below 65).

Employee Performance Reviews

Problem: Calibrate ratings across managers.

Approach:

  1. Compute five-number summary for each manager's ratings
  2. Compare medians (detect leniency/strictness bias)
  3. Use IQR to compare rating consistency
  4. Normalize to company-wide distribution

Example: Manager A median = 3.5/5, Manager B median = 4.5/5 → bias detected.

Quality Control (Manufacturing)

Problem: Detect defective batches.

Approach:

  1. Measure part dimensions (e.g., bolt diameter)
  2. Compute Q1, Q3, IQR for normal production
  3. Set control limits at Q1 - 3×IQR and Q3 + 3×IQR
  4. Flag parts outside limits as defects

Example: Bolt diameter target = 10mm, IQR = 0.05mm, limits = [9.85mm, 10.15mm].

A/B Testing (Web Analytics)

Problem: Compare two website designs.

Approach:

  1. Collect conversion rates for both versions
  2. Compare medians (more robust than means)
  3. Check if distributions overlap using IQR
  4. Use histogram to visualize differences

Example: Version A median = 3.2% conversion, Version B median = 3.8% conversion.

Toyota Way Principles in Action

Muda (Waste Elimination)

QuickSelect avoids unnecessary sorting:

  • Single quantile: No need to sort entire array
  • O(n) vs O(n log n) → 10-100x speedup on large datasets

Poka-Yoke (Error Prevention)

IQR-based methods resist outliers:

  • Freedman-Diaconis uses IQR (not σ)
  • Five-number summary uses quartiles (not mean/stddev)
  • Median unaffected by extreme values

Example: Dataset [10, 12, 15, 20, 5000]

  • Mean: ~1011 (dominated by outlier)
  • Median: 15 (robust)
  • IQR-based bin width: ~5 (captures true spread)

Heijunka (Load Balancing)

Adaptive binning adjusts to data:

  • Freedman-Diaconis: More bins for high IQR (spread out data)
  • Fewer bins for low IQR (tightly clustered data)
  • No manual tuning required

Exercises

  1. Change pass threshold: Set passing = 70. How many students pass? (25/30 = 83.3%)

  2. Remove outliers: Remove 45 and 52. Recompute:

    • Mean (should increase to ~83)
    • Median (should stay ~82.5)
    • IQR (should decrease slightly)
  3. Add more data: Simulate 100 students with rand::distributions::Normal. Compare:

    • Freedman-Diaconis vs Sturges bin counts
    • Median vs mean (should be closer for normal data)
  4. Compare binning methods: Which histogram best shows:

    • The struggling students? (Freedman-Diaconis, 7 bins)
    • Overall distribution shape? (Scott, 5 bins, smoother)

Further Reading

  • Quantile Methods: Hyndman, R.J., Fan, Y. (1996). "Sample Quantiles in Statistical Packages"
  • Histogram Binning: Freedman, D., Diaconis, P. (1981). "On the Histogram as a Density Estimator"
  • Outlier Detection: Tukey, J.W. (1977). "Exploratory Data Analysis"
  • QuickSelect: Floyd, R.W., Rivest, R.L. (1975). "Algorithm 489: The Algorithm SELECT"

Summary

  • Quantiles: Median (82.5) better than mean (80.5) for skewed data
  • Five-number summary: Robust description (min, Q1, median, Q3, max)
  • IQR (16.3): Measures spread, resistant to outliers
  • Outlier detection: 1.5 × IQR rule identified 1 struggling student (45.0)
  • Histograms: Freedman-Diaconis recommended (outlier-resistant, adaptive)
  • Performance: QuickSelect (10-100x faster for single quantiles)
  • Applications: Education, HR, manufacturing, A/B testing

Run the example yourself:

cargo run --example descriptive_statistics