Sovereign Tooling Map: World-Class or Wire It In
Every leaderboard-winning technique maps to a sovereign stack component. When a component doesn't support a technique at world-class level, we don't skip it — we find or build the capability and wire it into apr CLI commands.
5.1 Tooling Coverage Matrix
| Technique | Required Capability | Sovereign Component | Status | Gap Action |
|---|---|---|---|---|
| Import HF models | SafeTensors/GGUF → .apr | aprender 0.4.11 | ✅ Complete | apr import — 14+ architectures supported |
| Inference (decode) | Transformer forward pass | realizar 0.8 | ✅ Complete | apr run — 8-21% faster than llama.cpp |
| Inference (serve) | HTTP API, batching, streaming | realizar 0.8 | ✅ Complete | apr serve — OpenAI-compatible, PagedAttention |
| LoRA/QLoRA training | Low-rank adaptation, autograd | entrenar 0.7 | ✅ Complete | apr finetune — AdamW, cosine LR, checkpointing |
| Checkpoint management | Atomic save, resume, NaN scan, filtered load | aprender 0.4.11 | ✅ Complete | AprWriter::write() atomic (F-CKPT-009), AprReader::open_filtered() (F-CKPT-016), read_tensor_f32_checked() (F-CKPT-013), validate_tensor_shape() (F-CKPT-014) — 18/18 contracts |
| Knowledge distillation | KL-divergence, progressive, text-based | entrenar 0.7 | ✅ Complete | apr distill — standard, progressive, ensemble, text-based (GH-455) |
| Model merging | SLERP, TIES, DARE | aprender 0.4.11 | ✅ Complete | apr merge — 5 strategies |
| Pruning | Wanda, SparseGPT, structured | aprender 0.4.11 | ✅ Complete | apr prune — 6 methods |
| Quantization | INT4, INT8, Q4K, Q6K | aprender 0.4.11 | ✅ Complete | apr quantize — 4 formats |
| SIMD tensor ops | AVX2, AVX-512, NEON matmul | trueno 0.16.3 | ✅ Complete | 6% faster than NumPy at 256×256 |
| GPU compute | wgpu (Vulkan/Metal/DX12), CUDA PTX JIT | trueno 0.16.3 + trueno-gpu 0.4.35 | ✅ Complete | Pure Rust, any GPU vendor. wgpu cosine=0.999863 on Blackwell. See §25. |
| Speculative decoding | Draft model + verification | realizar 0.8 | ⚠️ Planned | GH-10: apr run --speculative not yet implemented |
| KV cache management | PagedAttention, CoW | realizar 0.8 | ✅ Complete | vLLM-style paged KV |
| Data loading | Parquet, JSONL, Arrow, HF Hub | alimentar 0.2 | ✅ Complete | Zero-copy Arrow RecordBatches |
| Data quality | Null/outlier/drift detection | alimentar 0.2 | ✅ Complete | 100-point quality scoring |
| Data decontamination | N-gram overlap detection | alimentar 0.2 | ✅ Wired | apr data decontaminate — n-gram overlap vs benchmarks (alimentar#30, aprender#415) |
| HPO | TPE, Hyperband, ASHA | entrenar 0.7 | ✅ Complete | apr tune --strategy tpe |
| Compile to binary | Model + runtime → executable | aprender 0.4.11 | ✅ Complete | apr compile |
| Correctness proofs | Kani bounded model checking | provable-contracts | ✅ Complete | 262 proof obligations |
| Quality gates | Compliance enforcement | pmat | ✅ Complete | 30+ automated checks |
| DPO/ORPO alignment | Preference optimization | entrenar 0.7 | ✅ Wired | make align → apr finetune --method dpo (GH-8: dedicated apr align planned) |
| Execution sandbox | Run generated code safely | — | ❌ Missing | External harness (see §5.3) |
| N-sampling + rerank | Batched generation, voting | aprender 0.27 | ⚠️ Partial | N-sampling via NUM_SAMPLES in eval script; --temperature + --top-k wired through batch mode. Reranking not yet implemented. |
| Prompt templates | SCoT, few-shot strategies | eval script | ✅ Working | 5 strategies in build_instruction(): standard, scot, few-shot, cgo, default. Few-shot best for HumanEval (+1.83pp). MBPP test assertions = +25.4pp. |
| Synthetic data gen | Teacher → training corpus | alimentar 0.2 + aprender | ⚠️ Partial | Generation via apr chat --batch; curation pipeline needed |
| Continued pretraining | Full-weight code corpus training | entrenar 0.7 | ⚠️ Partial | Full finetune works; needs large-corpus streaming |
| Flash Attention | Online softmax, tiled attention | trueno 0.16 | 🔧 In Progress | Phase 12 planned; tiling infra ready (wgpu compute shaders) |
5.2 Gap 1: DPO/ORPO Preference Optimization (CRITICAL)
Why world-class: DPO is the single most impactful post-training technique for leaderboards. Merged + DPO models "completely dominate" HF leaderboard rankings. Without DPO, we compete with one hand tied.
Current state: make align routes through apr finetune --method dpo
which connects to entrenar's loss functions. A dedicated apr align
subcommand is planned (GH-8).
Current implementation:
# DPO alignment via make align (routes through apr finetune)
make align CHECKPOINT=model.apr PREFS_DATA=prefs.jsonl ALIGN_METHOD=dpo
# Equivalent direct command
apr finetune model.apr --method dpo --data prefs.jsonl \
--output aligned.apr --verbose
Remaining wire-in plan:
Component: entrenar
Add: src/dpo/mod.rs — DPO loss (β-scaled log-ratio of policy vs reference)
Add: src/dpo/data.rs — preference pair loader (chosen/rejected format)
Add: src/dpo/orpo.rs — ORPO variant (no reference model needed)
Component: alimentar
Add: Preference pair generation from execution feedback
alimentar generate-preferences \
--model model.apr \
--problems humaneval.jsonl \
--n-samples 10 \
--judge execution \
-o preference-pairs.jsonl
Component: Ground truth corpus
Use: hf-ground-truth-corpus, algorithm-competition-corpus
→ Source of verified correct/incorrect code pairs for DPO training
Acceptance criterion: apr align --method dpo produces a model with ≥2% higher HumanEval+ than the input model after 3 epochs.
5.3 Gap 2: Code Execution Sandbox (CRITICAL)
Why world-class: HumanEval and MBPP require executing generated code against test cases. Without execution, we can't compute pass@k — we can only measure perplexity, which doesn't correlate well with code correctness.
Current state: aprender has no sandboxed code execution. Generated completions must be evaluated externally.
Wire-in plan (two options):
Option A: External EvalPlus harness (short-term, pragmatic)
apr eval model.apr --data humaneval.jsonl --n-samples 10 \
--output-completions completions/ --json
# Then externally: evalplus.evaluate --samples completions/
# This is what everyone does — even Google and Meta use external harnesses
Option B: WASM sandbox (long-term, sovereign)
Component: realizar or new crate
Add: Embedded WASM runtime (wasmtime) for safe code execution
apr eval model.apr --data humaneval.jsonl \
--sandbox wasm --timeout 10s --json
Advantage: Fully sovereign, no Python dependency even for eval
Risk: Python test cases require Python-in-WASM (CPython compiled to WASM)
Decision: Option A for v1.0 (get on the leaderboard), Option B as stretch goal. Neither compromises the "zero Python" claim for the model pipeline — eval is a separate concern.
5.4 Gap 3: N-Sampling + Reranking Pipeline
Why world-class: Generating N=10-50 completions and selecting the best one boosts effective pass@1 by 10-30%. This is the single most impactful inference-time technique.
Current state: aprender can generate multiple completions via temperature sampling. Missing: batched generation, reranking logic, majority voting.
Wire-in plan:
Component: aprender (apr-cli)
Extend: `apr eval --n-samples N --rerank strategy`
Strategies: logprob (sum of log-probabilities), majority (output voting),
execution (run and pick passing code — requires sandbox)
Component: realizar
Already supports: batched generation, concurrent requests
Need: expose batch generation for N completions per prompt efficiently
Component: alimentar
Add: Result aggregation and voting logic for N-sample outputs
5.5 Gap 4: Synthetic Training Data Pipeline
Why world-class: Qwen2.5-Coder, Phi-4, and NVIDIA OCR-Nemotron all credit large-scale synthetic data as core to their success. Without high-quality synthetic training data, fine-tuning is limited to existing datasets.
Current state: apr chat --batch can generate completions. alimentar handles data loading and quality scoring. Ground-truth corpora exist (hf-ground-truth-corpus, algorithm-competition-corpus). Missing: end-to-end curation pipeline.
Wire-in plan:
Component: alimentar
CLI pipeline:
# 1. Generate raw synthetic code from teacher
apr chat teacher.apr --batch problems.txt --n-samples 5 \
--temperature 0.8 --json > raw-synthetic.jsonl
# 2. Quality-filter with alimentar
alimentar quality raw-synthetic.jsonl --min-score 80 \
-o filtered-synthetic.jsonl
# 3. Decontaminate against eval benchmarks
alimentar drift raw-synthetic.jsonl \
--reference humaneval.jsonl mbpp.jsonl \
--overlap-threshold 0.01 \
-o clean-synthetic.jsonl
# 4. Balance and split
alimentar convert clean-synthetic.jsonl \
-o training-data.parquet
Component: Ground truth corpora
hf-ground-truth-corpus → HuggingFace API patterns, transformer implementations
algorithm-competition-corpus → Algorithm problems with verified solutions
→ Both feed into fine-tuning data mix
5.6 Gap 5: Prompt Strategy Engine
Why world-class: SCoT prompting improves HumanEval pass@1 by up to 13.79%. Few-shot exemplars add 3-8%. The prompt template matters as much as the model weights.
Current state: PROMPT_STRATEGY is implemented in scripts/eval-pass-at-k.sh with 4 built-in strategies. The upstream apr run --chat provides raw chat template support.
Implemented in eval pipeline:
# All 5 strategies work via Makefile targets (best: few-shot 87.20%):
make eval-humaneval CHECKPOINT=m.apr PROMPT_STRATEGY=standard
make eval-humaneval CHECKPOINT=m.apr PROMPT_STRATEGY=scot
make eval-humaneval CHECKPOINT=m.apr PROMPT_STRATEGY=few-shot
make eval-humaneval CHECKPOINT=m.apr PROMPT_STRATEGY=cgo
Built-in strategies (with aliases):
| Strategy | Aliases | Description |
|---|---|---|
standard | default | Raw problem → code (baseline) |
scot | structured-cot | Structured chain-of-thought → code (+5-14%) |
few-shot | fewshot | N exemplars + problem → code (+3-8%) |
cgo | code-gen-opt | Chain of grounded objectives → code (+5-10%) |
reflexion | reflect | Generate → test → reflect → regenerate (multi-turn) |
Remaining wire-in for upstream apr:
Component: realizar
Already supports: chat templates (ChatML, LLaMA2, Mistral, Phi, Alpaca)
Need: expose template composition for eval pipeline
5.7 Sovereign Stack Version Requirements
All gap closures must use published crates from crates.io. No git dependencies.
| Crate | Current | Required For Gaps | Minimum Version |
|---|---|---|---|
| aprender | 0.27.2 | apr align, --n-samples --rerank, checkpoint contracts (18/18 done in 0.27.2) | 0.28 |
| entrenar | 0.7.5 | DPO loss, preference pair loader, ORPO | 0.8 |
| trueno | 0.16.1 | Flash attention (Phase 12) | 0.17 |
| realizar | 0.8.0 | Batch N-sampling, prompt template composition | 0.9 |
| alimentar | 0.2.6 | Decontamination pipeline, preference pair generation, quality filtering | 0.3 |
| provable-contracts | 0.1 | DPO kernel contracts | 0.2 |
5.8 The Decision Rule
When we find a gap:
- Can an existing sovereign crate do it? → Wire it in via
aprCLI. No new crates. - Does a sovereign crate need a new module? → Add it to that crate, publish to crates.io, bump apr-leaderboard's dependency.
- Is it fundamentally outside the stack's scope? → Use an external tool (e.g., EvalPlus for code execution) and document the boundary explicitly.
- Is it a research problem with no clear solution? → Add to §21 Open Questions. Don't block the pipeline.
Hard rule: We never add a Python dependency. We never add a C/C++ FFI dependency. GPU compute is wgpu (primary, any vendor, pure Rust) with optional CUDA backend for hardware where wgpu support lags (e.g., Blackwell sm_121). No GPU vendor lock-in. If the sovereign stack can't do it in pure Rust, we either build it or scope it out with an explicit boundary.
5.9 Parity Check: Ludwig Feature Coverage
Ludwig (ludwig.ai) is the state-of-the-art declarative ML framework. Every feature Ludwig ships, the sovereign stack must match or exceed — in pure Rust, with zero Python. This is the parity bar.
5.9.1 Feature-by-Feature Parity Matrix
Training & Fine-tuning:
| Ludwig Feature | Ludwig Implementation | Sovereign Stack | Status |
|---|---|---|---|
| Full fine-tuning | PyTorch, trainable=true | entrenar apr finetune --method full | ✅ Parity |
| LoRA adapters | PEFT library, configurable rank/dropout/targets | entrenar apr finetune --method lora | ✅ Parity |
| QLoRA (4-bit base + LoRA) | bitsandbytes + PEFT | entrenar apr finetune --method qlora | ✅ Parity |
| AdaLoRA (dynamic rank allocation) | PEFT AdaLoRA | entrenar — not yet | ❌ Gap |
| IA3 (inhibiting/amplifying activations) | PEFT IA3 | entrenar — not yet | ❌ Gap |
| DoRA (weight-decomposed LoRA) | PEFT DoRA variant | entrenar — not yet | ❌ Gap |
| NEFTune (embedding noise) | noise injection during fine-tune | entrenar — not yet | ❌ Gap |
| Gradient accumulation | PyTorch native | entrenar gradient accumulation | ✅ Parity |
| Mixed precision (fp16/bf16) | PyTorch AMP | entrenar GradScaler, bf16/fp16 | ✅ Parity |
| Early stopping | callback-based | entrenar EarlyStopping callback | ✅ Parity |
| Checkpointing | periodic save, atomic write, resume | aprender AprWriter::write() (atomic) + entrenar CheckpointCallback | ✅ Exceeds (18 contracts: atomic writes, NaN scan, filtered load, round-trip determinism, provenance) |
| Learning rate warmup + cosine decay | scheduler | entrenar WarmupCosineDecayLR | ✅ Parity |
Optimizers:
| Ludwig Feature | Ludwig Implementation | Sovereign Stack | Status |
|---|---|---|---|
| AdamW | PyTorch AdamW | entrenar AdamW (SIMD-accelerated) | ✅ Exceeds |
| Adam | PyTorch Adam | entrenar Adam | ✅ Parity |
| SGD with momentum | PyTorch SGD | entrenar SGD with momentum | ✅ Parity |
| 8-bit optimizers | bitsandbytes 8-bit Adam | — not yet | ❌ Gap |
| Paged optimizers | bitsandbytes paged | — not yet | ❌ Gap |
Distributed Training:
| Ludwig Feature | Ludwig Implementation | Sovereign Stack | Status |
|---|---|---|---|
| Multi-GPU DDP | PyTorch DDP via Ray | — not yet (single-GPU via wgpu) | ❌ Gap |
| DeepSpeed ZeRO | Microsoft DeepSpeed | — not yet | ❌ Gap |
| Multi-node training | Ray cluster | entrenar GPU-SHARE Phase 3 (SSH cluster, job placement) | ✅ Exceeds (heterogeneous: 4090 + Jetson + CPU nodes) |
| Automatic batch size selection | binary search on GPU OOM | aprender --vram planning + entrenar VRAM guard | ✅ Parity |
| GPU sharing (multi-adapter) | not supported | entrenar GPU-SHARE (multi-adapter single-process, 3x VRAM savings) | ✅ Exceeds |
Quantization:
| Ludwig Feature | Ludwig Implementation | Sovereign Stack | Status |
|---|---|---|---|
| 4-bit quantization (nf4/fp4) | bitsandbytes | aprender INT4, Q4K | ✅ Parity |
| 8-bit quantization | bitsandbytes | aprender INT8, Q8_0 | ✅ Parity |
| Double quantization | bitsandbytes nested | — not yet | ⚠️ Partial |
| GPTQ | auto-gptq | — not yet | ❌ Gap |
| AWQ | autoawq | — not yet | ❌ Gap |
Inference & Generation:
| Ludwig Feature | Ludwig Implementation | Sovereign Stack | Status |
|---|---|---|---|
| Greedy decoding | HF generate | realizar greedy | ✅ Parity |
| Temperature sampling | HF generate | realizar temperature | ✅ Parity |
| Top-k sampling | HF generate | realizar top-k | ✅ Parity |
| Nucleus (top-p) sampling | HF generate | realizar top-p | ✅ Parity |
| Beam search | HF generate | aprender num_beams | ✅ Parity |
| Contrastive search | HF generate | — not yet | ❌ Gap |
| Diverse beam search | HF generate | — not yet | ❌ Gap |
| Repetition penalty | HF generate | aprender repetition_penalty | ✅ Parity |
| Speculative decoding | not supported | realizar speculative | ✅ Exceeds |
| Streaming generation | not documented | realizar SSE streaming | ✅ Exceeds |
| OpenAI-compatible API | not supported | realizar /v1/chat/completions | ✅ Exceeds |
| PagedAttention KV cache | not supported | realizar paged KV | ✅ Exceeds |
| Continuous batching | not supported | realizar batch scheduling | ✅ Exceeds |
Serving & Deployment:
| Ludwig Feature | Ludwig Implementation | Sovereign Stack | Status |
|---|---|---|---|
| REST API serving | ludwig serve (Flask) | realizar apr serve (Axum) | ✅ Parity |
| Docker containers | prebuilt images | — user-provided | ⚠️ Partial |
| TorchScript export | PyTorch jit.trace | — not applicable (native binary) | N/A |
| Triton Inference Server | export format | — not applicable | N/A |
| HuggingFace Hub upload | ludwig upload | aprender apr publish | ✅ Parity |
| Compile to standalone binary | not supported | aprender apr compile | ✅ Exceeds |
| ONNX/CoreML/OpenVINO export | not supported | aprender apr export | ✅ Exceeds |
Data Processing:
| Ludwig Feature | Ludwig Implementation | Sovereign Stack | Status |
|---|---|---|---|
| CSV/JSON/Parquet/HDF5 loading | pandas | alimentar Arrow-native | ✅ Exceeds (zero-copy) |
| Auto preprocessing per feature type | Ludwig preprocessors | alimentar transforms | ✅ Parity |
| Train/val/test splitting | Ludwig split | alimentar DatasetSplit (stratified) | ✅ Parity |
| Larger-than-memory datasets | Ray datasets | alimentar MmapDataset, streaming | ✅ Parity |
| Data quality scoring | not built-in | alimentar 100-point quality scoring | ✅ Exceeds |
| Drift detection | not built-in | alimentar KS/Chi-sq/PSI/JSD | ✅ Exceeds |
| Imbalance detection + resampling | not built-in | alimentar SMOTE, oversample | ✅ Exceeds |
Hyperparameter Optimization:
| Ludwig Feature | Ludwig Implementation | Sovereign Stack | Status |
|---|---|---|---|
| Random search | Ray Tune | entrenar RandomSearch | ✅ Parity |
| Grid search | Ray Tune | entrenar GridSearch | ✅ Parity |
| Bayesian (TPE) | Ray Tune Optuna | entrenar TPEOptimizer | ✅ Parity |
| ASHA scheduler | Ray Tune ASHA | entrenar HyperbandScheduler | ✅ Parity |
| Distributed HPO | Ray cluster | — not yet (local only) | ❌ Gap |
Model Architecture:
| Ludwig Feature | Ludwig Implementation | Sovereign Stack | Status |
|---|---|---|---|
| ECD (Encoder-Combiner-Decoder) | Ludwig native | — different architecture | N/A (not needed) |
| GBM (LightGBM) | LightGBM wrapper | — not in scope | N/A |
| LLM causal models | HF Transformers | aprender + realizar | ✅ Parity |
| Multi-modal (text+image+audio) | ECD combiner | — LLM-only for leaderboard | N/A (future) |
| Multi-task learning | multiple output heads | — not yet | ⚠️ Partial |
| Custom PyTorch modules | register API | — Rust modules via entrenar | ✅ Parity |
Experiment Tracking:
| Ludwig Feature | Ludwig Implementation | Sovereign Stack | Status |
|---|---|---|---|
| TensorBoard | callback | — not yet | ❌ Gap |
| Weights & Biases | callback | — not yet | ❌ Gap |
| MLflow | callback | — not yet | ❌ Gap |
| Comet ML | callback | — not yet | ❌ Gap |
| Built-in TUI monitoring | not supported | entrenar monitor + TUI | ✅ Exceeds |
| Prometheus metrics | not supported | realizar /metrics | ✅ Exceeds |
Explainability & Visualization:
| Ludwig Feature | Ludwig Implementation | Sovereign Stack | Status |
|---|---|---|---|
| Feature importance | built-in | entrenar ExplainabilityCallback | ✅ Parity |
| Learning curves | matplotlib | entrenar MonitorCallback | ⚠️ Partial |
| Confusion matrices | built-in | entrenar eval metrics | ⚠️ Partial |
| Model architecture visualization | built-in | aprender apr tree, apr flow | ✅ Parity |
Correctness & Quality (sovereign stack advantages):
| Feature | Ludwig | Sovereign Stack | Advantage |
|---|---|---|---|
| Provable kernel correctness | none | provable-contracts Kani L4 | ✅ Unique |
| 262 proof obligations | none | provable-contracts | ✅ Unique |
| Compliance enforcement | none | pmat comply 30+ checks | ✅ Unique |
| Deterministic builds | pip/conda chaos | Cargo.lock | ✅ Unique |
| wgpu GPU compute (any vendor) | requires CUDA toolkit | trueno wgpu (Vulkan/Metal/DX12) | ✅ Unique |
| Format-agnostic conversion | not supported | aprender apr rosetta | ✅ Unique |
| Model diff/forensics | not supported | aprender apr diff, apr hex | ✅ Unique |
| 10-stage integrity check | not supported | aprender apr check | ✅ Unique |
5.9.2 Summary: Where We Exceed, Where We Must Close Gaps
We have parity in 24+ areas: LoRA, QLoRA, full fine-tuning, AdamW/Adam/SGD, gradient accumulation, mixed precision, early stopping, LR scheduling, all sampling strategies, beam search, REST serving, HF upload, data loading, preprocessing, train/val/test splits, HPO (grid/random/TPE/ASHA), feature importance.
We exceed Ludwig in 16+ areas (updated): speculative decoding, PagedAttention, continuous batching, streaming API, OpenAI-compatible serving, compile-to-binary, multi-format export (ONNX/CoreML/OpenVINO), data quality scoring, drift detection, imbalance detection, Prometheus metrics, TUI monitoring, provable contracts, deterministic builds, format forensics, checkpointing (18 verified contracts: atomic writes, NaN scan, filtered loading, round-trip determinism, provenance chain — vs Ludwig's basic callback).
Gaps to close (9 items):
| Gap | Priority | Wire-in Target |
|---|---|---|
| AdaLoRA (dynamic rank) | Medium | entrenar 0.8 |
| IA3 adapter | Low | entrenar 0.8 |
| DoRA (weight-decomposed LoRA) | Medium | entrenar 0.8 |
| NEFTune (embedding noise) | Low | entrenar 0.8 |
| 8-bit optimizers | Low | entrenar 0.8 |
| Contrastive search decoding | Low | aprender 0.28 |
| Diverse beam search | Low | aprender 0.28 |
| Multi-GPU DDP | High | entrenar 0.9 |
| GPTQ quantization | Medium | aprender 0.28 |
Recently closed gaps:
Multi-node training→ GPU-SHARE Phase 3: SSH cluster config, job placement, checkpoint coordination (143 tests)Automatic batch size selection→ VRAM guard + ledger prevents OOM,--vramplanningExperiment tracking→entrenarTUI monitor + JSONL event logging + checkpoint metadata
Out of scope (not needed for leaderboard): ECD architecture, GBM/LightGBM, multi-modal (text+image+audio), Triton export, TorchScript. These serve Ludwig's "general ML framework" positioning. We are a purpose-built leaderboard pipeline, not a general framework.
5.10 GPU Compute Architecture: PTX JIT vs Pre-compiled Kernels
5.10.1 Why PTX JIT (Not nvcc)
PyTorch ships fat binaries — pre-compiled SASS (GPU machine code) for every supported architecture (sm_70, sm_80, sm_86, sm_89, sm_90). At runtime, the CUDA driver selects the matching SASS — zero JIT, instant startup. This requires nvcc (NVIDIA's proprietary compiler) and the CUDA toolkit (~2+ GB) at build time.
trueno-gpu takes a fundamentally different approach: PTX string templates embedded in Rust. PTX (Parallel Thread Execution) is NVIDIA's stable intermediate assembly language. trueno-gpu writes CUDA kernels directly as PTX strings in Rust source code, compiled into the apr binary by cargo build — no nvcc, no CUDA toolkit, no C/C++ FFI.
At runtime, the CUDA driver JIT-compiles PTX to device-specific SASS for whatever GPU is present. This is the same mechanism PyTorch uses as a fallback for unsupported architectures — trueno-gpu uses it as the primary path.
5.10.2 Trade-offs
| Aspect | PyTorch (pre-compiled SASS) | trueno-gpu (PTX JIT) |
|---|---|---|
| Build deps | nvcc + CUDA toolkit (2+ GB) | cargo build only |
| New GPU support | Requires new release with SASS | Automatic (PTX forward-compatible) |
| Startup time | Instant | 20-80s JIT (amortized by --batch-jsonl) |
| Binary size | ~500 MB (fat binaries) | ~10 MB (PTX strings) |
| Vendor lock-in | CUDA toolkit version | None (PTX is stable ISA) |
| Reproducibility | Tied to CUDA/cuDNN version | Same binary, any NVIDIA GPU |
5.10.3 Amortization via Batch Mode
The --batch-jsonl flag is the architectural answer to JIT overhead. For a 164-problem HumanEval eval:
- Without batch: 80s JIT × 164 invocations = 3.6 hours of JIT alone
- With batch: 80s JIT × 1 load = 80s total JIT, then pure inference
Amortized JIT cost per problem: <0.5s. The sovereignty benefit (zero external toolchain, forward GPU compatibility) far outweighs the one-time startup cost.
5.10.4 Blackwell sm_121 and the Try 1/Try 2 Pattern
On Blackwell (sm_121), the CUDA 13.0 driver has a JIT bug: it rejects PTX with .target sm_121 (error 300, CUDA_ERROR_INVALID_SOURCE). The GH-480 fix implements a defensive fallback:
- Try 1: Compile PTX with explicit
.target sm_121— fails (error 300) - Try 2: Compile with
cuModuleLoadData(no explicit target) — succeeds
This Try 1 → Try 2 pattern is a driver workaround, not a design choice. When NVIDIA fixes the sm_121 JIT in a future driver, Try 1 will succeed and the fallback becomes dead code. The PTX post-processor (GH-480) also patches backward bra LABEL instructions to @%p_jw bra LABEL for sm_121 compatibility.
5.10.5 FP8 Architecture Guard (GH-542)
FP8 E4M3 GEMM kernels (Ada/Hopper-specific) cause CUDA_ERROR_ILLEGAL_ADDRESS on Blackwell, poisoning the CUDA context. Fix: detect_fp8_prefill() uses cc >= 89 && cc < 100 to auto-disable FP8 on Blackwell. Provable contract: gpu-context-health-v1.yaml (3 proof obligations, 3 falsification tests).
Five-whys: (1) Why crash? FP8 warmup writes invalid memory on sm_121. (2) Why invalid? FP8 E4M3 cuBLASLt kernels are Ada/Hopper-specific. (3) Why enabled? cc >= 89 without upper bound. (4) Why no bound? Blackwell didn't exist when written. (5) Fix: cc < 100 guard in 3 files (commit a4bcd908).