14. Batuta Falsification Checklist
14.1 108-Item Popperian Assessment
The Albor project itself is subject to batuta’s 108-item falsification checklist:
# Full assessment
batuta falsify . --verbose --format markdown --output docs/falsification-report.md
# Critical-only (blocks release)
batuta falsify . --critical-only
# CI-friendly output
batuta falsify . --format github-actions --min-grade kaizen-required
14.2 Key Sections Applied to Albor
Section 1: Sovereign Data Governance (SDG)
- All training data has documented provenance (HuggingFace commit SHAs)
- No PII in training corpus (alimentar quality check)
- Data residency: all data stored on owned hardware (lambda + intel)
- Teacher model license verified (Apache 2.0)
Section 3: Hypothesis-Driven Development (HDD)
- Each improvement stage has a falsifiable hypothesis:
- “Distillation improves avg benchmark by >5%” (FALSIFY-ALBOR-005)
- “Pruning at 50% sparsity degrades benchmarks by <2%” (FALSIFY-ALBOR-008)
- “Q4 quantization degrades perplexity by <5%” (FALSIFY-ALBOR-009)
- Reproducibility standard: Gold (deterministic seeds, versioned data, BLAKE3 checkpoint hashes, Cargo.lock pinning)
Section 4: Numerical Reproducibility (NR)
- Float determinism enforced via fixed seeds and operator ordering
- Cross-platform consistency: checkpoint trained on lambda loads on intel
- SIMD parity: all kernels have provable-contracts SIMD equivalence obligations
Section 5: Performance & Waste Elimination (PW)
- Seven Wastes (Muda) applied to training pipeline:
- No redundant data copies (alimentar streaming)
- No idle GPU time (pre-computed teacher logits)
- No over-processing (progressive model sizing: 50M → 125M → 350M)
Section 6: Safety & Formal Verification (SF)
- Critical kernels have Kani proofs (softmax, attention, cross-entropy)
- New kernels (KD loss, gradient accumulation) get Kani harnesses
Section 10: Architectural Invariants (AI) — CRITICAL
- AI-01: All model operations use apr (no manual weight manipulation)
- AI-02: Every checkpoint is BLAKE3-hashed and version-tracked
- AI-03: Training config is immutable once committed (no runtime overrides)
- AI-04: Eval results are reproducible (fixed seed, deterministic batching)
- AI-05: No undeclared dependencies (Cargo.lock enforced)
14.3 Current Grade
Perfect Score: 100.0% (108/108 PASS) — achieved 2026-03-04.
This exceeds the Toyota Standard (90-100%) target:
- All 5 Critical items pass (Section 10)
- All Major items pass
- All Minor items pass
- Zero PARTIAL, zero FAIL
Score progression across 14 MLOps survey batches: 34% → 100%
(see entrenar/docs/specifications/world-class-mlops-survey.md).