Sovereign Tooling Map: World-Class or Wire It In

Every leaderboard-winning technique maps to a sovereign stack component. When a component doesn't support a technique at world-class level, we don't skip it — we find or build the capability and wire it into apr CLI commands.

5.1 Tooling Coverage Matrix

TechniqueRequired CapabilitySovereign ComponentStatusGap Action
Import HF modelsSafeTensors/GGUF → .apraprender 0.4.11✅ Completeapr import — 14+ architectures supported
Inference (decode)Transformer forward passrealizar 0.8✅ Completeapr run — 8-21% faster than llama.cpp
Inference (serve)HTTP API, batching, streamingrealizar 0.8✅ Completeapr serve — OpenAI-compatible, PagedAttention
LoRA/QLoRA trainingLow-rank adaptation, autogradentrenar 0.7✅ Completeapr finetune — AdamW, cosine LR, checkpointing
Checkpoint managementAtomic save, resume, NaN scan, filtered loadaprender 0.4.11✅ CompleteAprWriter::write() atomic (F-CKPT-009), AprReader::open_filtered() (F-CKPT-016), read_tensor_f32_checked() (F-CKPT-013), validate_tensor_shape() (F-CKPT-014) — 18/18 contracts
Knowledge distillationKL-divergence, progressive, text-basedentrenar 0.7✅ Completeapr distill — standard, progressive, ensemble, text-based (GH-455)
Model mergingSLERP, TIES, DAREaprender 0.4.11✅ Completeapr merge — 5 strategies
PruningWanda, SparseGPT, structuredaprender 0.4.11✅ Completeapr prune — 6 methods
QuantizationINT4, INT8, Q4K, Q6Kaprender 0.4.11✅ Completeapr quantize — 4 formats
SIMD tensor opsAVX2, AVX-512, NEON matmultrueno 0.16.3✅ Complete6% faster than NumPy at 256×256
GPU computewgpu (Vulkan/Metal/DX12), CUDA PTX JITtrueno 0.16.3 + trueno-gpu 0.4.35✅ CompletePure Rust, any GPU vendor. wgpu cosine=0.999863 on Blackwell. See §25.
Speculative decodingDraft model + verificationrealizar 0.8⚠️ PlannedGH-10: apr run --speculative not yet implemented
KV cache managementPagedAttention, CoWrealizar 0.8✅ CompletevLLM-style paged KV
Data loadingParquet, JSONL, Arrow, HF Hubalimentar 0.2✅ CompleteZero-copy Arrow RecordBatches
Data qualityNull/outlier/drift detectionalimentar 0.2✅ Complete100-point quality scoring
Data decontaminationN-gram overlap detectionalimentar 0.2Wiredapr data decontaminate — n-gram overlap vs benchmarks (alimentar#30, aprender#415)
HPOTPE, Hyperband, ASHAentrenar 0.7✅ Completeapr tune --strategy tpe
Compile to binaryModel + runtime → executableaprender 0.4.11✅ Completeapr compile
Correctness proofsKani bounded model checkingprovable-contracts✅ Complete262 proof obligations
Quality gatesCompliance enforcementpmat✅ Complete30+ automated checks
DPO/ORPO alignmentPreference optimizationentrenar 0.7Wiredmake alignapr finetune --method dpo (GH-8: dedicated apr align planned)
Execution sandboxRun generated code safelyMissingExternal harness (see §5.3)
N-sampling + rerankBatched generation, votingaprender 0.27⚠️ PartialN-sampling via NUM_SAMPLES in eval script; --temperature + --top-k wired through batch mode. Reranking not yet implemented.
Prompt templatesSCoT, few-shot strategieseval scriptWorking5 strategies in build_instruction(): standard, scot, few-shot, cgo, default. Few-shot best for HumanEval (+1.83pp). MBPP test assertions = +25.4pp.
Synthetic data genTeacher → training corpusalimentar 0.2 + aprender⚠️ PartialGeneration via apr chat --batch; curation pipeline needed
Continued pretrainingFull-weight code corpus trainingentrenar 0.7⚠️ PartialFull finetune works; needs large-corpus streaming
Flash AttentionOnline softmax, tiled attentiontrueno 0.16🔧 In ProgressPhase 12 planned; tiling infra ready (wgpu compute shaders)

5.2 Gap 1: DPO/ORPO Preference Optimization (CRITICAL)

Why world-class: DPO is the single most impactful post-training technique for leaderboards. Merged + DPO models "completely dominate" HF leaderboard rankings. Without DPO, we compete with one hand tied.

Current state: make align routes through apr finetune --method dpo which connects to entrenar's loss functions. A dedicated apr align subcommand is planned (GH-8).

Current implementation:

# DPO alignment via make align (routes through apr finetune)
make align CHECKPOINT=model.apr PREFS_DATA=prefs.jsonl ALIGN_METHOD=dpo

# Equivalent direct command
apr finetune model.apr --method dpo --data prefs.jsonl \
    --output aligned.apr --verbose

Remaining wire-in plan:

Component: entrenar
  Add: src/dpo/mod.rs — DPO loss (β-scaled log-ratio of policy vs reference)
  Add: src/dpo/data.rs — preference pair loader (chosen/rejected format)
  Add: src/dpo/orpo.rs — ORPO variant (no reference model needed)

Component: alimentar
  Add: Preference pair generation from execution feedback
    alimentar generate-preferences \
      --model model.apr \
      --problems humaneval.jsonl \
      --n-samples 10 \
      --judge execution \
      -o preference-pairs.jsonl

Component: Ground truth corpus
  Use: hf-ground-truth-corpus, algorithm-competition-corpus
    → Source of verified correct/incorrect code pairs for DPO training

Acceptance criterion: apr align --method dpo produces a model with ≥2% higher HumanEval+ than the input model after 3 epochs.

5.3 Gap 2: Code Execution Sandbox (CRITICAL)

Why world-class: HumanEval and MBPP require executing generated code against test cases. Without execution, we can't compute pass@k — we can only measure perplexity, which doesn't correlate well with code correctness.

Current state: aprender has no sandboxed code execution. Generated completions must be evaluated externally.

Wire-in plan (two options):

Option A: External EvalPlus harness (short-term, pragmatic)
  apr eval model.apr --data humaneval.jsonl --n-samples 10 \
    --output-completions completions/ --json
  # Then externally: evalplus.evaluate --samples completions/
  # This is what everyone does — even Google and Meta use external harnesses

Option B: WASM sandbox (long-term, sovereign)
  Component: realizar or new crate
  Add: Embedded WASM runtime (wasmtime) for safe code execution
    apr eval model.apr --data humaneval.jsonl \
      --sandbox wasm --timeout 10s --json
  Advantage: Fully sovereign, no Python dependency even for eval
  Risk: Python test cases require Python-in-WASM (CPython compiled to WASM)

Decision: Option A for v1.0 (get on the leaderboard), Option B as stretch goal. Neither compromises the "zero Python" claim for the model pipeline — eval is a separate concern.

5.4 Gap 3: N-Sampling + Reranking Pipeline

Why world-class: Generating N=10-50 completions and selecting the best one boosts effective pass@1 by 10-30%. This is the single most impactful inference-time technique.

Current state: aprender can generate multiple completions via temperature sampling. Missing: batched generation, reranking logic, majority voting.

Wire-in plan:

Component: aprender (apr-cli)
  Extend: `apr eval --n-samples N --rerank strategy`
    Strategies: logprob (sum of log-probabilities), majority (output voting),
                execution (run and pick passing code — requires sandbox)

Component: realizar
  Already supports: batched generation, concurrent requests
  Need: expose batch generation for N completions per prompt efficiently

Component: alimentar
  Add: Result aggregation and voting logic for N-sample outputs

5.5 Gap 4: Synthetic Training Data Pipeline

Why world-class: Qwen2.5-Coder, Phi-4, and NVIDIA OCR-Nemotron all credit large-scale synthetic data as core to their success. Without high-quality synthetic training data, fine-tuning is limited to existing datasets.

Current state: apr chat --batch can generate completions. alimentar handles data loading and quality scoring. Ground-truth corpora exist (hf-ground-truth-corpus, algorithm-competition-corpus). Missing: end-to-end curation pipeline.

Wire-in plan:

Component: alimentar
  CLI pipeline:
    # 1. Generate raw synthetic code from teacher
    apr chat teacher.apr --batch problems.txt --n-samples 5 \
      --temperature 0.8 --json > raw-synthetic.jsonl

    # 2. Quality-filter with alimentar
    alimentar quality raw-synthetic.jsonl --min-score 80 \
      -o filtered-synthetic.jsonl

    # 3. Decontaminate against eval benchmarks
    alimentar drift raw-synthetic.jsonl \
      --reference humaneval.jsonl mbpp.jsonl \
      --overlap-threshold 0.01 \
      -o clean-synthetic.jsonl

    # 4. Balance and split
    alimentar convert clean-synthetic.jsonl \
      -o training-data.parquet

Component: Ground truth corpora
  hf-ground-truth-corpus → HuggingFace API patterns, transformer implementations
  algorithm-competition-corpus → Algorithm problems with verified solutions
  → Both feed into fine-tuning data mix

5.6 Gap 5: Prompt Strategy Engine

Why world-class: SCoT prompting improves HumanEval pass@1 by up to 13.79%. Few-shot exemplars add 3-8%. The prompt template matters as much as the model weights.

Current state: PROMPT_STRATEGY is implemented in scripts/eval-pass-at-k.sh with 4 built-in strategies. The upstream apr run --chat provides raw chat template support.

Implemented in eval pipeline:

# All 5 strategies work via Makefile targets (best: few-shot 87.20%):
make eval-humaneval CHECKPOINT=m.apr PROMPT_STRATEGY=standard
make eval-humaneval CHECKPOINT=m.apr PROMPT_STRATEGY=scot
make eval-humaneval CHECKPOINT=m.apr PROMPT_STRATEGY=few-shot
make eval-humaneval CHECKPOINT=m.apr PROMPT_STRATEGY=cgo

Built-in strategies (with aliases):

StrategyAliasesDescription
standarddefaultRaw problem → code (baseline)
scotstructured-cotStructured chain-of-thought → code (+5-14%)
few-shotfewshotN exemplars + problem → code (+3-8%)
cgocode-gen-optChain of grounded objectives → code (+5-10%)
reflexionreflectGenerate → test → reflect → regenerate (multi-turn)

Remaining wire-in for upstream apr:

Component: realizar
  Already supports: chat templates (ChatML, LLaMA2, Mistral, Phi, Alpaca)
  Need: expose template composition for eval pipeline

5.7 Sovereign Stack Version Requirements

All gap closures must use published crates from crates.io. No git dependencies.

CrateCurrentRequired For GapsMinimum Version
aprender0.27.2apr align, --n-samples --rerank, checkpoint contracts (18/18 done in 0.27.2)0.28
entrenar0.7.5DPO loss, preference pair loader, ORPO0.8
trueno0.16.1Flash attention (Phase 12)0.17
realizar0.8.0Batch N-sampling, prompt template composition0.9
alimentar0.2.6Decontamination pipeline, preference pair generation, quality filtering0.3
provable-contracts0.1DPO kernel contracts0.2

5.8 The Decision Rule

When we find a gap:

  1. Can an existing sovereign crate do it? → Wire it in via apr CLI. No new crates.
  2. Does a sovereign crate need a new module? → Add it to that crate, publish to crates.io, bump apr-leaderboard's dependency.
  3. Is it fundamentally outside the stack's scope? → Use an external tool (e.g., EvalPlus for code execution) and document the boundary explicitly.
  4. Is it a research problem with no clear solution? → Add to §21 Open Questions. Don't block the pipeline.

Hard rule: We never add a Python dependency. We never add a C/C++ FFI dependency. GPU compute is wgpu (primary, any vendor, pure Rust) with optional CUDA backend for hardware where wgpu support lags (e.g., Blackwell sm_121). No GPU vendor lock-in. If the sovereign stack can't do it in pure Rust, we either build it or scope it out with an explicit boundary.

5.9 Parity Check: Ludwig Feature Coverage

Ludwig (ludwig.ai) is the state-of-the-art declarative ML framework. Every feature Ludwig ships, the sovereign stack must match or exceed — in pure Rust, with zero Python. This is the parity bar.

5.9.1 Feature-by-Feature Parity Matrix

Training & Fine-tuning:

Ludwig FeatureLudwig ImplementationSovereign StackStatus
Full fine-tuningPyTorch, trainable=trueentrenar apr finetune --method full✅ Parity
LoRA adaptersPEFT library, configurable rank/dropout/targetsentrenar apr finetune --method lora✅ Parity
QLoRA (4-bit base + LoRA)bitsandbytes + PEFTentrenar apr finetune --method qlora✅ Parity
AdaLoRA (dynamic rank allocation)PEFT AdaLoRAentrenar — not yetGap
IA3 (inhibiting/amplifying activations)PEFT IA3entrenar — not yetGap
DoRA (weight-decomposed LoRA)PEFT DoRA variantentrenar — not yetGap
NEFTune (embedding noise)noise injection during fine-tuneentrenar — not yetGap
Gradient accumulationPyTorch nativeentrenar gradient accumulation✅ Parity
Mixed precision (fp16/bf16)PyTorch AMPentrenar GradScaler, bf16/fp16✅ Parity
Early stoppingcallback-basedentrenar EarlyStopping callback✅ Parity
Checkpointingperiodic save, atomic write, resumeaprender AprWriter::write() (atomic) + entrenar CheckpointCallbackExceeds (18 contracts: atomic writes, NaN scan, filtered load, round-trip determinism, provenance)
Learning rate warmup + cosine decayschedulerentrenar WarmupCosineDecayLR✅ Parity

Optimizers:

Ludwig FeatureLudwig ImplementationSovereign StackStatus
AdamWPyTorch AdamWentrenar AdamW (SIMD-accelerated)✅ Exceeds
AdamPyTorch Adamentrenar Adam✅ Parity
SGD with momentumPyTorch SGDentrenar SGD with momentum✅ Parity
8-bit optimizersbitsandbytes 8-bit Adam— not yetGap
Paged optimizersbitsandbytes paged— not yetGap

Distributed Training:

Ludwig FeatureLudwig ImplementationSovereign StackStatus
Multi-GPU DDPPyTorch DDP via Ray— not yet (single-GPU via wgpu)Gap
DeepSpeed ZeROMicrosoft DeepSpeed— not yetGap
Multi-node trainingRay clusterentrenar GPU-SHARE Phase 3 (SSH cluster, job placement)Exceeds (heterogeneous: 4090 + Jetson + CPU nodes)
Automatic batch size selectionbinary search on GPU OOMaprender --vram planning + entrenar VRAM guard✅ Parity
GPU sharing (multi-adapter)not supportedentrenar GPU-SHARE (multi-adapter single-process, 3x VRAM savings)Exceeds

Quantization:

Ludwig FeatureLudwig ImplementationSovereign StackStatus
4-bit quantization (nf4/fp4)bitsandbytesaprender INT4, Q4K✅ Parity
8-bit quantizationbitsandbytesaprender INT8, Q8_0✅ Parity
Double quantizationbitsandbytes nested— not yet⚠️ Partial
GPTQauto-gptq— not yetGap
AWQautoawq— not yetGap

Inference & Generation:

Ludwig FeatureLudwig ImplementationSovereign StackStatus
Greedy decodingHF generaterealizar greedy✅ Parity
Temperature samplingHF generaterealizar temperature✅ Parity
Top-k samplingHF generaterealizar top-k✅ Parity
Nucleus (top-p) samplingHF generaterealizar top-p✅ Parity
Beam searchHF generateaprender num_beams✅ Parity
Contrastive searchHF generate— not yetGap
Diverse beam searchHF generate— not yetGap
Repetition penaltyHF generateaprender repetition_penalty✅ Parity
Speculative decodingnot supportedrealizar speculativeExceeds
Streaming generationnot documentedrealizar SSE streamingExceeds
OpenAI-compatible APInot supportedrealizar /v1/chat/completionsExceeds
PagedAttention KV cachenot supportedrealizar paged KVExceeds
Continuous batchingnot supportedrealizar batch schedulingExceeds

Serving & Deployment:

Ludwig FeatureLudwig ImplementationSovereign StackStatus
REST API servingludwig serve (Flask)realizar apr serve (Axum)✅ Parity
Docker containersprebuilt images— user-provided⚠️ Partial
TorchScript exportPyTorch jit.trace— not applicable (native binary)N/A
Triton Inference Serverexport format— not applicableN/A
HuggingFace Hub uploadludwig uploadaprender apr publish✅ Parity
Compile to standalone binarynot supportedaprender apr compileExceeds
ONNX/CoreML/OpenVINO exportnot supportedaprender apr exportExceeds

Data Processing:

Ludwig FeatureLudwig ImplementationSovereign StackStatus
CSV/JSON/Parquet/HDF5 loadingpandasalimentar Arrow-native✅ Exceeds (zero-copy)
Auto preprocessing per feature typeLudwig preprocessorsalimentar transforms✅ Parity
Train/val/test splittingLudwig splitalimentar DatasetSplit (stratified)✅ Parity
Larger-than-memory datasetsRay datasetsalimentar MmapDataset, streaming✅ Parity
Data quality scoringnot built-inalimentar 100-point quality scoringExceeds
Drift detectionnot built-inalimentar KS/Chi-sq/PSI/JSDExceeds
Imbalance detection + resamplingnot built-inalimentar SMOTE, oversampleExceeds

Hyperparameter Optimization:

Ludwig FeatureLudwig ImplementationSovereign StackStatus
Random searchRay Tuneentrenar RandomSearch✅ Parity
Grid searchRay Tuneentrenar GridSearch✅ Parity
Bayesian (TPE)Ray Tune Optunaentrenar TPEOptimizer✅ Parity
ASHA schedulerRay Tune ASHAentrenar HyperbandScheduler✅ Parity
Distributed HPORay cluster— not yet (local only)Gap

Model Architecture:

Ludwig FeatureLudwig ImplementationSovereign StackStatus
ECD (Encoder-Combiner-Decoder)Ludwig native— different architectureN/A (not needed)
GBM (LightGBM)LightGBM wrapper— not in scopeN/A
LLM causal modelsHF Transformersaprender + realizar✅ Parity
Multi-modal (text+image+audio)ECD combiner— LLM-only for leaderboardN/A (future)
Multi-task learningmultiple output heads— not yet⚠️ Partial
Custom PyTorch modulesregister API— Rust modules via entrenar✅ Parity

Experiment Tracking:

Ludwig FeatureLudwig ImplementationSovereign StackStatus
TensorBoardcallback— not yetGap
Weights & Biasescallback— not yetGap
MLflowcallback— not yetGap
Comet MLcallback— not yetGap
Built-in TUI monitoringnot supportedentrenar monitor + TUIExceeds
Prometheus metricsnot supportedrealizar /metricsExceeds

Explainability & Visualization:

Ludwig FeatureLudwig ImplementationSovereign StackStatus
Feature importancebuilt-inentrenar ExplainabilityCallback✅ Parity
Learning curvesmatplotlibentrenar MonitorCallback⚠️ Partial
Confusion matricesbuilt-inentrenar eval metrics⚠️ Partial
Model architecture visualizationbuilt-inaprender apr tree, apr flow✅ Parity

Correctness & Quality (sovereign stack advantages):

FeatureLudwigSovereign StackAdvantage
Provable kernel correctnessnoneprovable-contracts Kani L4Unique
262 proof obligationsnoneprovable-contractsUnique
Compliance enforcementnonepmat comply 30+ checksUnique
Deterministic buildspip/conda chaosCargo.lockUnique
wgpu GPU compute (any vendor)requires CUDA toolkittrueno wgpu (Vulkan/Metal/DX12)Unique
Format-agnostic conversionnot supportedaprender apr rosettaUnique
Model diff/forensicsnot supportedaprender apr diff, apr hexUnique
10-stage integrity checknot supportedaprender apr checkUnique

5.9.2 Summary: Where We Exceed, Where We Must Close Gaps

We have parity in 24+ areas: LoRA, QLoRA, full fine-tuning, AdamW/Adam/SGD, gradient accumulation, mixed precision, early stopping, LR scheduling, all sampling strategies, beam search, REST serving, HF upload, data loading, preprocessing, train/val/test splits, HPO (grid/random/TPE/ASHA), feature importance.

We exceed Ludwig in 16+ areas (updated): speculative decoding, PagedAttention, continuous batching, streaming API, OpenAI-compatible serving, compile-to-binary, multi-format export (ONNX/CoreML/OpenVINO), data quality scoring, drift detection, imbalance detection, Prometheus metrics, TUI monitoring, provable contracts, deterministic builds, format forensics, checkpointing (18 verified contracts: atomic writes, NaN scan, filtered loading, round-trip determinism, provenance chain — vs Ludwig's basic callback).

Gaps to close (9 items):

GapPriorityWire-in Target
AdaLoRA (dynamic rank)Mediumentrenar 0.8
IA3 adapterLowentrenar 0.8
DoRA (weight-decomposed LoRA)Mediumentrenar 0.8
NEFTune (embedding noise)Lowentrenar 0.8
8-bit optimizersLowentrenar 0.8
Contrastive search decodingLowaprender 0.28
Diverse beam searchLowaprender 0.28
Multi-GPU DDPHighentrenar 0.9
GPTQ quantizationMediumaprender 0.28

Recently closed gaps:

  • Multi-node training → GPU-SHARE Phase 3: SSH cluster config, job placement, checkpoint coordination (143 tests)
  • Automatic batch size selection → VRAM guard + ledger prevents OOM, --vram planning
  • Experiment trackingentrenar TUI monitor + JSONL event logging + checkpoint metadata

Out of scope (not needed for leaderboard): ECD architecture, GBM/LightGBM, multi-modal (text+image+audio), Triton export, TorchScript. These serve Ludwig's "general ML framework" positioning. We are a purpose-built leaderboard pipeline, not a general framework.

5.10 GPU Compute Architecture: PTX JIT vs Pre-compiled Kernels

5.10.1 Why PTX JIT (Not nvcc)

PyTorch ships fat binaries — pre-compiled SASS (GPU machine code) for every supported architecture (sm_70, sm_80, sm_86, sm_89, sm_90). At runtime, the CUDA driver selects the matching SASS — zero JIT, instant startup. This requires nvcc (NVIDIA's proprietary compiler) and the CUDA toolkit (~2+ GB) at build time.

trueno-gpu takes a fundamentally different approach: PTX string templates embedded in Rust. PTX (Parallel Thread Execution) is NVIDIA's stable intermediate assembly language. trueno-gpu writes CUDA kernels directly as PTX strings in Rust source code, compiled into the apr binary by cargo build — no nvcc, no CUDA toolkit, no C/C++ FFI.

At runtime, the CUDA driver JIT-compiles PTX to device-specific SASS for whatever GPU is present. This is the same mechanism PyTorch uses as a fallback for unsupported architectures — trueno-gpu uses it as the primary path.

5.10.2 Trade-offs

AspectPyTorch (pre-compiled SASS)trueno-gpu (PTX JIT)
Build depsnvcc + CUDA toolkit (2+ GB)cargo build only
New GPU supportRequires new release with SASSAutomatic (PTX forward-compatible)
Startup timeInstant20-80s JIT (amortized by --batch-jsonl)
Binary size~500 MB (fat binaries)~10 MB (PTX strings)
Vendor lock-inCUDA toolkit versionNone (PTX is stable ISA)
ReproducibilityTied to CUDA/cuDNN versionSame binary, any NVIDIA GPU

5.10.3 Amortization via Batch Mode

The --batch-jsonl flag is the architectural answer to JIT overhead. For a 164-problem HumanEval eval:

  • Without batch: 80s JIT × 164 invocations = 3.6 hours of JIT alone
  • With batch: 80s JIT × 1 load = 80s total JIT, then pure inference

Amortized JIT cost per problem: <0.5s. The sovereignty benefit (zero external toolchain, forward GPU compatibility) far outweighs the one-time startup cost.

5.10.4 Blackwell sm_121 and the Try 1/Try 2 Pattern

On Blackwell (sm_121), the CUDA 13.0 driver has a JIT bug: it rejects PTX with .target sm_121 (error 300, CUDA_ERROR_INVALID_SOURCE). The GH-480 fix implements a defensive fallback:

  1. Try 1: Compile PTX with explicit .target sm_121 — fails (error 300)
  2. Try 2: Compile with cuModuleLoadData (no explicit target) — succeeds

This Try 1 → Try 2 pattern is a driver workaround, not a design choice. When NVIDIA fixes the sm_121 JIT in a future driver, Try 1 will succeed and the fallback becomes dead code. The PTX post-processor (GH-480) also patches backward bra LABEL instructions to @%p_jw bra LABEL for sm_121 compatibility.

5.10.5 FP8 Architecture Guard (GH-542)

FP8 E4M3 GEMM kernels (Ada/Hopper-specific) cause CUDA_ERROR_ILLEGAL_ADDRESS on Blackwell, poisoning the CUDA context. Fix: detect_fp8_prefill() uses cc >= 89 && cc < 100 to auto-disable FP8 on Blackwell. Provable contract: gpu-context-health-v1.yaml (3 proof obligations, 3 falsification tests).

Five-whys: (1) Why crash? FP8 warmup writes invalid memory on sm_121. (2) Why invalid? FP8 E4M3 cuBLASLt kernels are Ada/Hopper-specific. (3) Why enabled? cc >= 89 without upper bound. (4) Why no bound? Blackwell didn't exist when written. (5) Fix: cc < 100 guard in 3 files (commit a4bcd908).