Chapter 37: Popper Falsifiability Score

The pmat popper-score command evaluates a project’s scientific rigor against Karl Popper’s falsifiability criterion, providing a comprehensive 100-point score that measures whether claims are empirically testable.

Overview

The Popper Falsifiability Score implements Popper’s demarcation criterion from The Logic of Scientific Discovery (1959):

“A theory that is not refutable by any conceivable event is non-scientific. Irrefutability is not a virtue of a theory (as people often think) but a vice.”

This score helps teams:

  • Quantify scientific rigor of software claims
  • Identify unfalsifiable statements that cannot be verified
  • Enforce reproducibility standards
  • Improve empirical evidence in documentation

The Falsifiability Gateway

Critical Concept: If Category A (Falsifiability & Testability) scores below 60%, the total score is automatically 0.

This implements Popper’s key insight: without falsifiable claims, no amount of other quality metrics matter. A project making unfalsifiable claims (like “fastest API ever” without benchmarks) cannot be scientifically validated.

┌─────────────────────────────────────────────────────────┐
│              FALSIFIABILITY GATEWAY                     │
│                                                         │
│   Category A >= 60%?                                    │
│         │                                               │
│    ┌────┴────┐                                          │
│   YES       NO                                          │
│    │         │                                          │
│    ▼         ▼                                          │
│  Score    Score = 0                                     │
│  calculated  "Gateway Failed"                           │
└─────────────────────────────────────────────────────────┘

Score Categories (100 Points Total)

Category A: Falsifiability & Testability (25 points) - GATEWAY

This category gates the entire score.

Sub-ScorePointsCriteria
A1: Explicit Falsifiable Claims8README contains measurable success/failure criteria
A2: Test Coverage as Evidence10Tests exist, coverage measured, CI runs them
A3: Confidence Intervals7Statistical uncertainty quantified

What Makes a Claim Falsifiable?

# UNFALSIFIABLE (❌)
"This library is extremely fast and efficient."
"Our API is robust and reliable."
"Handles all edge cases gracefully."

# FALSIFIABLE (✅)
"Response time < 100ms for 99th percentile (measured via Criterion)"
"Zero memory leaks verified with Valgrind on test suite"
"Handles inputs up to 10GB without OOM (tested in CI)"

Category B: Reproducibility Infrastructure (25 points)

Sub-ScorePointsCriteria
B1: Dependency Pinning10Cargo.lock/package-lock.json committed
B2: Containerization8Dockerfile or Nix flake present
B3: Build Reproducibility7Makefile, clear build instructions

Category C: Transparency & Openness (20 points)

Sub-ScorePointsCriteria
C1: License5OSI-approved license present
C2: Documentation8Comprehensive README, API docs, CHANGELOG
C3: Design Decisions7ADRs, CONTRIBUTING guide

Category D: Statistical Rigor (15 points)

Sub-ScorePointsCriteria
D1: Sample Documentation5Data sources documented
D2: Effect Size Reporting5Confidence intervals, not just p-values
D3: Comparison Baselines5Benchmarks compare against alternatives

Category E: Historical Integrity (10 points)

Sub-ScorePointsCriteria
E1: Version Control4Git repository with history
E2: Pre-registration3Roadmap/design docs before implementation
E3: Change Documentation3CHANGELOG with semantic versioning

Category F: ML/AI Reproducibility (5 points or N/A)

Only applies to ML/AI projects. For non-ML projects, this category is excluded from scoring.

Sub-ScorePointsCriteria
F1: Random Seed Management2Seeds documented and reproducible
F2: Model Artifacts2Weights/checkpoints version-controlled
F3: Dataset Documentation1Training data sources documented

N/A Handling: For non-ML projects, the denominator excludes Category F:

  • ML project: Score out of 100 points
  • Non-ML project: Score out of 95 points (normalized to 100%)

Grading System

GradeScore RangeInterpretation
A+95-100%Exceptional scientific rigor
A85-94%Meets PMAT standards
B70-84%Good, minor gaps
C55-69%Acceptable, improvement needed
D40-54%Below standard
F1-39%Failing
GATEWAY FAILED0%Category A < 60%

Usage

Basic Usage

# Score current project
pmat popper-score

# Score specific path
pmat popper-score --path /path/to/project

Output Formats

# Text (default, terminal)
pmat popper-score

# JSON (CI/CD integration)
pmat popper-score --format json

# Markdown (documentation)
pmat popper-score --format markdown --output SCORE.md

# YAML (config files)
pmat popper-score --format yaml

Verbose Mode

# Show detailed breakdown of all sub-scores
pmat popper-score --verbose

Failures Only

# Show only failing checks and recommendations
pmat popper-score --failures-only

Command Aliases

# All equivalent:
pmat popper-score
pmat popper
pmat falsifiability

Complete Example

$ pmat popper-score --verbose

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔬  Popper Falsifiability Score v1.1.0
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✅  Gateway: PASSED (Falsifiability >= 60%)

📌  Summary
  Score: 78.5/95
  Normalized: 82.6%
  Grade: B

📂  Categories
  ✅ A. Falsifiability & Testability: 18.0/25 (72.0%) [GATEWAY]
    ✓ A1: 6.0/8 - explicit claims found, measurable thresholds found
    ✓ A2: 8.0/10 - test files exist, coverage config found, CI runs tests
    ~ A3: 4.0/7 - confidence intervals mentioned
  ⚠️ B. Reproducibility Infrastructure: 17.0/25 (68.0%)
    ✓ B1: 8.0/10 - Cargo.lock found
    ~ B2: 4.0/8 - Dockerfile found
    ✓ B3: 5.0/7 - Makefile found, standard build config found
  ✅ C. Transparency & Openness: 18.0/20 (90.0%)
    ✓ C1: 5.0/5 - LICENSE exists, OSI-approved license
    ✓ C2: 8.0/8 - comprehensive README, API documentation
    ✓ C3: 5.0/7 - CONTRIBUTING guide exists
  ⚠️ D. Statistical Rigor: 9.0/15 (60.0%)
    ~ D1: 3.0/5 - sample documentation found
    ~ D2: 3.0/5 - confidence intervals found
    ~ D3: 3.0/5 - comparison baselines found
  ✅ E. Historical Integrity: 8.5/10 (85.0%)
    ✓ E1: 4.0/4 - git repository
    ✓ E2: 2.5/3 - roadmap documented
    ✓ E3: 2.0/3 - CHANGELOG exists, semantic versioning
  ⚪ F. ML/AI Reproducibility: N/A

📋  Verdict
  GOOD: Project demonstrates solid scientific practices with room for improvement
  in statistical rigor and reproducibility documentation.

💡  Recommendations
  🟡 [Statistical Rigor] Add Criterion benchmarks with confidence intervals
     $ cargo add criterion --dev
  🟡 [Reproducibility] Add Nix flake for reproducible builds
     $ nix flake init
  🟢 [Transparency] Add ADR (Architecture Decision Records)
     $ mkdir docs/adr && echo "# ADR-001" > docs/adr/001-initial.md

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Gateway Failed Example

$ pmat popper-score

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔬  Popper Falsifiability Score v1.1.0
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

❌  Gateway: FAILED (Falsifiability < 60%)
    Without falsifiable claims, score is 0.

📌  Summary
  Score: 0.0/100
  Normalized: 0.0%
  Grade: GATEWAY FAILED

📂  Categories
  ❌ A. Falsifiability & Testability: 10.0/25 (40.0%) [GATEWAY]
  ⚠️ B. Reproducibility Infrastructure: 15.0/25 (60.0%)
  ✅ C. Transparency & Openness: 18.0/20 (90.0%)
  ⚠️ D. Statistical Rigor: 8.0/15 (53.3%)
  ✅ E. Historical Integrity: 8.0/10 (80.0%)
  ⚪ F. ML/AI Reproducibility: N/A

📋  Verdict
  GATEWAY FAILED: Project does not meet minimum falsifiability requirements.
  Without testable claims, independent verification is not possible.

💡  Recommendations
  🔴 [Falsifiability (Gateway)] Add explicit falsifiable claims and test coverage.
     This is required for any score above 0.
     $ pmat quality-gate --checks tests,coverage

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Programmatic Usage

use pmat::services::popper_score::{score_project, PopperScore};
use std::path::Path;

fn main() -> anyhow::Result<()> {
    let score = score_project(Path::new("."))?;

    println!("Gateway Passed: {}", score.gateway_passed);
    println!("Normalized Score: {:.1}%", score.normalized_score);
    println!("Grade: {}", score.grade);

    // Access individual categories
    let falsifiability = &score.categories.falsifiability;
    println!(
        "Falsifiability: {:.1}/{:.0} ({:.1}%)",
        falsifiability.earned,
        falsifiability.max,
        falsifiability.percentage()
    );

    Ok(())
}

Run the example:

cargo run --example popper_score_demo

CI/CD Integration

GitHub Actions

name: Popper Falsifiability Score

on:
  push:
    branches: [ main ]
  pull_request:

jobs:
  popper-score:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install PMAT
        run: cargo install pmat

      - name: Run Popper Score
        run: pmat popper-score --format json > popper-score.json

      - name: Upload Score
        uses: actions/upload-artifact@v3
        with:
          name: popper-score
          path: popper-score.json

      - name: Enforce Gateway
        run: |
          GATEWAY=$(jq '.gateway_passed' popper-score.json)
          if [ "$GATEWAY" != "true" ]; then
            echo "❌ Falsifiability Gateway FAILED"
            echo "Add explicit falsifiable claims to README"
            exit 1
          fi

      - name: Enforce Minimum Score
        run: |
          SCORE=$(jq '.normalized_score' popper-score.json)
          if (( $(echo "$SCORE < 70" | bc -l) )); then
            echo "❌ Score $SCORE below B threshold (70%)"
            exit 1
          fi

Pre-commit Hook

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: popper-score
        name: Popper Falsifiability Check
        entry: bash -c 'pmat popper-score --format json | jq -e ".gateway_passed"'
        language: system
        pass_filenames: false
        always_run: true

Improving Your Score

Quick Wins

  1. Add falsifiable claims to README:

    ## Performance
    
    - Response time: < 50ms (99th percentile, measured via Criterion)
    - Memory: < 100MB for 1M records (tested in CI)
    - Throughput: > 10K requests/second (benchmark: wrk)
    
  2. Add test coverage reporting:

    cargo llvm-cov --html
    
  3. Commit Cargo.lock:

    git add Cargo.lock && git commit -m "Add lock file for reproducibility"
    
  4. Add Dockerfile:

    FROM rust:1.75-slim
    WORKDIR /app
    COPY . .
    RUN cargo build --release
    

Medium-Term Improvements

  1. Add Criterion benchmarks with confidence intervals
  2. Document data sources and sample sizes
  3. Add ADRs (Architecture Decision Records)
  4. Create CHANGELOG with semantic versioning

For ML Projects

  1. Document random seeds:

    SEED = 42
    torch.manual_seed(SEED)
    np.random.seed(SEED)
    
  2. Version control model artifacts:

    dvc init
    dvc add models/
    
  3. Document training data sources

Philosophical Foundation

The Popper Falsifiability Score is grounded in:

Karl Popper’s Falsificationism

From The Logic of Scientific Discovery (1959):

  • Science progresses through bold conjectures and rigorous refutation attempts
  • Unfalsifiable claims are not scientific
  • Reproducibility is essential for independent verification

Toyota Way Principles

PrincipleApplication
Genchi GenbutsuEvidence-based scoring (go see for yourself)
JidokaGateway stops the line on unfalsifiable claims
KaizenContinuous improvement through recommendations
HanseiHonest reflection on current state
  • pmat repo-score - General repository health (language-agnostic)
  • pmat rust-project-score - Rust-specific quality metrics
  • pmat quality-gate - Enforce quality thresholds
  • pmat five-whys - Root cause analysis

Summary

The pmat popper-score command provides:

  • 100-point scoring across 6 categories
  • Falsifiability Gateway (Category A < 60% = 0 total score)
  • N/A handling for ML category in non-ML projects
  • Multiple output formats (text, JSON, markdown, YAML)
  • Actionable recommendations with priority ranking

Key Differentiators:

  • Scientific rigor focus (vs. general code quality)
  • Gateway mechanism (Popper’s demarcation criterion)
  • Evidence-based claims verification
  • Reproducibility infrastructure checks

Run your first score:

pmat popper-score --verbose