Training Performance Specification
0. Design Principles
This specification follows design by contract (DbC). Every performance
claim, optimization target, and implementation phase begins with a provable
contract (pv validate) that defines equations, invariants, proof obligations,
and falsification tests. Code is written to satisfy the contract — never the
reverse.
Verification stack (sovereign, no external dependencies):
| Layer | Tool | Role |
|---|---|---|
| Contract | pv (provable-contracts) | YAML equations, proof obligations, falsification tests, Kani harnesses |
| Benchmark | Raw C + Criterion + regression | Three-tier: raw C cuBLAS (ceiling) vs Rust cuBLAS vs PTX (floor) |
| Profiling | probador (probar) | Brick budgets, per-component SLA enforcement, Jidoka gates |
| Tracing | renacer (BrickTracer) | Per-kernel/per-block/per-transfer spans, OTLP export, anomaly escalation |
| Measurement | renacer (metrics) | Counter/Gauge/Histogram with SIMD acceleration (trueno) |
Workflow for every optimization phase:
1. pv validate contracts/cublas-gemm-v1.yaml # Contract first
2. pv scaffold contracts/cublas-gemm-v1.yaml # Generate test stubs
3. make bench-gemm-raw # Establish ceiling
4. Implement against contract
5. make bench-gemm-compare # Three-tier benchmark
6. probador brick budgets verify per-component SLAs # Brick profiling
7. renacer --trace-compute traces per-kernel timing # Layer tracing
8. pv audit contracts/cublas-gemm-v1.yaml # Binding coverage
9. Dogfood on 350M training run
10. make bench-gemm-regression # No regressions
11. Close gap in §11
1. Current Performance Baseline
1.1 Measured Throughput
| Metric | Value | Config |
|---|---|---|
| Throughput (pre-optimization) | 934 tok/s | 350M, seq=1024, batch=4, RTX 4090 |
| Step time (pre-optimization) | ~4.4s | Same config |
| Throughput (current, Phase 5b) | 7,676 tok/s | Same config (steady state, step 1000) |
| Step time (current, Phase 5b) | 513 ms | Same config (steady state) |
| MFU (current, Phase 5b) | 22.2% | vs FP32 peak (as reported by trainer) |
| VRAM usage | ~11.6 GB / 24 GB | Same config |
| Training loss (v3, step 26K) | 6.61 | v3 run (PID 1975811, codeparrot-clean) |
| Validation loss (v3, step 26K) | 6.91 | val_ppl=1000.3 |
| Loss trajectory (v3) | 10.40 → 6.61 (step 26K) | v3 run (250K steps target) |
| Gradient norm (v3) | 3.04 → 0.13 (step 1K → 26K) | Monotonic decrease |
| Tokens processed (v3) | 108M | 26,400 × 4 × 1024 |
1.2 MFU Analysis
Model FLOPs Utilization (MFU) measures actual compute throughput against hardware theoretical peak. For a transformer forward+backward pass, the standard approximation is 6 x params x tokens_per_step FLOPs.
Model parameters: 370M (24 layers, hidden=1024, intermediate=4096)
Tokens per step: 4 x 1024 = 4,096 tokens
FLOPs per step: 6 x 370M x 4,096 = 9.1 TFLOP
Step time: 4.4s
Achieved FLOP/s: 9.1 TFLOP / 4.4s = 2.07 TFLOP/s
RTX 4090 FP16 peak: 165 TFLOP/s (with tensor cores)
RTX 4090 FP32 peak: 82.6 TFLOP/s (without tensor cores)
MFU (vs FP16 peak): 2.07 / 165 = 1.3%
MFU (vs FP32 peak): 2.07 / 82.6 = 2.5%
MFU = 2.5% (vs FP32 peak) / 1.3% (vs FP16 peak)
1.3 Research Benchmarks for Context
| System | Model Size | Hardware | MFU | Source |
|---|---|---|---|---|
| GPT-3 (OpenAI) | 175B | A100 cluster | 21% | Brown et al. 2020 |
| PaLM (Google) | 540B | TPU v4 | 46-57% | Chowdhery et al. 2022 |
| LLaMA (Meta) | 65B | A100 80GB | 36% | Touvron et al. 2023 |
| Chinchilla (DeepMind) | 70B | TPU v3/v4 | ~40% | Hoffmann et al. 2022 |
| Typical single-GPU PyTorch | 350M | RTX 4090 | 25-35% | Community benchmarks |
| Albor (current) | 370M | RTX 4090 | 2.5% | Measured |
The gap is 10-15x vs what the hardware can deliver for this model size.
1.4 Baseline Profiling Protocol (renacer + probador)
Before any optimization, establish ground truth with brick-level profiling:
# Layer-level tracing: per-kernel timing for one training step
renacer --otlp-endpoint http://localhost:4317 \
--otlp-service-name "albor-baseline" \
--trace-compute \
--trace-compute-threshold 100 \
-- apr train apply --task pretrain \
--config configs/train/pretrain-350m-cuda-test.yaml
# View in Jaeger: http://localhost:16686 -> Service: "albor-baseline"
# Each GEMM kernel, norm kernel, PCIe transfer is a span with duration_us
BrickTracer escalation thresholds for baseline measurement:
#![allow(unused)]
fn main() {
let thresholds = BrickEscalationThresholds::default()
.with_cv(15.0) // Escalate if kernel timing CV > 15%
.with_efficiency(25.0) // Escalate if compute efficiency < 25%
.with_rate_limit(100); // Max 100 traces/second during profiling
}
Brick budget breakdown (probador) — defines the per-component SLA that each optimization phase must improve:
#![allow(unused)]
fn main() {
let step_budget = BrickHouseBuilder::new("training-step")
.budget_ms(4400) // Current step time
.brick("gemm_forward", 1400) // 7 GEMMs x 24 blocks + LM head
.brick("gemm_backward", 1100) // 14 GEMMs x 24 blocks + LM head
.brick("cpu_optimizer", 800) // 24 blocks + LM head + embedding
.brick("cpu_embedding", 200) // Scatter-gather forward + backward
.brick("pcie_transfer", 150) // 3 transfers (H2D embed, D2H logits, H2D grad)
.brick("elementwise_kernel", 100) // RMSNorm, RoPE, SiLU
.brick("cross_entropy", 50) // Fused CE forward + backward
.brick("stream_sync", 50) // ALB-065 synchronization
.brick("overhead", 550) // Scheduling, allocator, host logic
.build()?;
}
Each brick has a Jidoka gate: if any component exceeds its budget by >2x after an optimization, training stops and alerts. This prevents silent regressions.
2. Root Cause Analysis
2.1 The GEMM Bottleneck
A 350M transformer forward+backward step executes 552 GEMM operations:
Per transformer block (24 blocks):
Forward:
- Q projection: GEMM [S, H] x [H, H] (1)
- K projection: GEMM [S, H] x [H, H_kv] (1)
- V projection: GEMM [S, H] x [H, H_kv] (1)
- Attention out: GEMM [S, H] x [H, H] (1)
- FFN gate: GEMM [S, H] x [H, I] (1)
- FFN up: GEMM [S, H] x [H, I] (1)
- FFN down: GEMM [S, I] x [I, H] (1)
Backward (roughly 2x forward):
- dQ, dK, dV, dAttn_out, dGate, dUp, dDown (7)
- Weight gradients for each of the above (7)
Subtotal per block: 7 + 14 = 21 GEMMs
LM head (vocab projection):
Forward: GEMM [S, H] x [H, V] (1)
Backward: GEMM for dInput + dWeight (2)
Subtotal: 3 GEMMs
Embedding (scatter-add, not GEMM): (0)
Total: 24 x 21 + 3 = 507 weight GEMMs
+ attention score GEMMs: 24 x 2 = 48 (QK^T forward + backward)
= 555 GEMM operations per step
2.2 Hand-Written PTX vs Tensor Cores
All GEMMs use hand-written PTX tiled GEMM kernels in trueno-gpu:
GemmForwardKernel::tiled_unrolled()— FP32 accumulation, no tensor coresGemmBackwardAKernel::tiled_unrolled()— Input gradient GEMMGemmBackwardBKernel::tiled_unrolled()— Weight gradient GEMM
These kernels:
- Use scalar FP32 FMA instructions (
fma.rn.f32) - Tile sizes are small (typically 16x16 or 32x32)
- No shared memory double-buffering or software pipelining
- Cannot use tensor cores (require
wmmaormmaPTX instructions)
The RTX 4090 (Ada Lovelace, SM 8.9) has 128 FP32 CUDA cores per SM x 128 SMs = 16,384 CUDA cores. But it also has 4th generation tensor cores that deliver 165 TFLOP/s FP16 — 2x the FP32 throughput — and these are completely unused.
2.3 Non-GEMM Overhead
| Component | Approximate Time | Notes |
|---|---|---|
| PCIe transfers (3 per step) | ~50-100ms | H2D embed, D2H logits, H2D grad_logits |
| CPU embedding forward/backward | ~100-200ms | Scatter-gather on CPU, not GPU |
| Per-block optimizer step (CPU) | ~500-800ms | AdamW on CPU for each of 24 blocks |
| RMSNorm, RoPE, SiLU kernels | ~50ms | Small element-wise kernels |
| Fused cross-entropy | ~20ms | Custom PTX kernel |
| Stream synchronization | ~10-50ms | ALB-065: required before D2H |
The per-block CPU optimizer (download gradients -> AdamW on CPU -> upload weights) is the second largest bottleneck after GEMM throughput. ALB-067 disabled per-block gradient clipping due to CPU-side L2 norm cost (864 D2H transfers/step).
2.4 Step Time Breakdown (Estimated)
Total step time: 4,400 ms (100%)
+-- 555 GEMM operations: 2,500 ms ( 57%) <-- PRIMARY BOTTLENECK
+-- CPU optimizer (24x): 800 ms ( 18%) <-- SECONDARY BOTTLENECK
+-- CPU embedding: 200 ms ( 5%)
+-- PCIe transfers: 150 ms ( 3%)
+-- Element-wise kernels: 100 ms ( 2%)
+-- Cross-entropy: 50 ms ( 1%)
+-- Stream sync: 50 ms ( 1%)
+-- Overhead (Python-free): 550 ms ( 13%)
2.5 Confirming the Breakdown: Layer Tracing Protocol
The estimated breakdown in 2.4 must be confirmed with measurement before optimizing. Renacer BrickTracer provides per-brick isolation:
#![allow(unused)]
fn main() {
// In entrenar CudaTransformerTrainer::train_step_single()
let tracer = BrickTracer::new_local();
// Trace each phase as a separate brick
let embed_result = tracer.trace("embed_forward", 200, || {
// CPU scatter-gather embedding lookup
embed_forward(&input_ids, &embed_weight)
});
let h2d_result = tracer.trace("pcie_h2d_hidden", 50, || {
hidden_buf.copy_from_host(&hidden_states)
});
for block_idx in 0..24 {
let fwd_result = tracer.trace(
&format!("block_{}_forward", block_idx), 100, || {
block.forward(&workspace)
}
);
// BrickTracer records: duration_us, budget_us, efficiency, over_budget
}
}
Escalation: When any brick’s CV exceeds 15% (unstable timing) or efficiency drops below 25% (idle GPU), BrickTracer automatically captures full syscall-level traces and exports as OTLP spans. This is the renacer “measurement -> tracing” escalation pattern — lightweight metrics in steady state, detailed tracing only on anomaly.
The confirmed breakdown becomes the contract baseline that optimization phases are proven against.
3. Contracts: Write Before Code
3.1 Contract: cuBLAS GEMM Integration
File: contracts/cublas-gemm-v1.yaml
This contract must be written and validated (pv validate) before any
cuBLAS code is written. It defines the algebraic invariants, numerical bounds,
and falsification tests that the implementation must satisfy.
# contracts/cublas-gemm-v1.yaml
metadata:
version: "1.0.0"
created: "2026-03-05"
author: "PAIML Engineering"
description: "cuBLAS tensor core GEMM integration for training throughput"
references:
- "Micikevicius et al. (2018) Mixed Precision Training"
- "NVIDIA cuBLAS Documentation (CUDA 12.x)"
- "training-gpu-kernel-v1.yaml (parent contract)"
depends_on:
- "training-gpu-kernel-v1"
- "training-memory-kernel-v1"
equations:
cublas_gemm_correctness:
formula: |
C_cublas = alpha * op(A) * op(B) + beta * C
where op(X) = X if transa=N, X^T if transa=T
A: FP16 [m, k], B: FP16 [k, n], C: FP16 [m, n]
Accumulation: FP32 (CUBLAS_COMPUTE_32F)
domain: "FP16 input buffers, FP32 accumulation, FP16 output"
codomain: "C_cublas: FP16 result matrix"
invariants:
- "max_abs_diff(C_cublas, C_ptx) < 1e-2 for identical inputs"
- "cuBLAS uses tensor cores when math mode is TENSOR_OP_MATH"
- "FP32 accumulation prevents catastrophic cancellation"
buffer_size_verification:
formula: |
For cublasGemmEx(m, n, k, A, B, C):
A.len() >= m * k * sizeof(FP16) = m * k * 2
B.len() >= k * n * sizeof(FP16) = k * n * 2
C.len() >= m * n * sizeof(FP16) = m * n * 2
domain: "GpuBuffer lengths in bytes"
codomain: "Boolean: all buffers sufficient"
invariants:
- "Verified at call site, not inside cuBLAS (Rule 2: prove at kernel boundary)"
- "Assertion failure = immediate panic, not silent corruption"
handle_lifecycle:
formula: |
create: cublasCreate_v2(&handle) -> CUBLAS_STATUS_SUCCESS
bind: cublasSetStream_v2(handle, stream) before every GEMM
drop: cublasDestroy_v2(handle) exactly once
invariants:
- "One handle per CudaContext (thread-safe within context)"
- "Stream set before EVERY cublasGemmEx call (C-STREAMSYNC-001 extension)"
- "Handle destroyed on Drop (Rust RAII)"
- "No default stream usage — always explicit non-blocking stream"
mfu_improvement:
formula: |
MFU = achieved_flops / hardware_peak_flops
achieved_flops = 6 * P * tokens_per_step / step_time
P = 370M, tokens_per_step = 4096
hardware_peak_flops(FP16) = 165 TFLOP/s
domain: "Measured step_time after cuBLAS integration"
codomain: "MFU ratio [0, 1]"
invariants:
- "MFU(cublas) > MFU(ptx) (strict improvement)"
- "MFU(cublas) >= 0.025 (must beat current 2.5% FP32 baseline)"
mixed_precision_weight_flow:
formula: |
CPU master weights: FP32 (optimizer operates here)
GPU forward weights: FP16 (cast during upload)
GPU activation gradients: FP16 (cuBLAS backward output)
GPU weight gradients: FP32 (accumulated in FP32 buffer)
CPU gradient download: FP32 (for optimizer update)
invariants:
- "Master weights ALWAYS FP32 on CPU (no precision loss in optimizer)"
- "Weight gradient accumulation in FP32 (no underflow in small gradients)"
- "C-EMBED-GRAD-001 still holds: activation grad clipped before CPU scatter-add"
- "C-HYPERPARAMS-001 still holds: all optimizer params from YAML config"
proof_obligations:
- type: equivalence
property: "cuBLAS GEMM matches PTX GEMM"
formal: "max_abs_diff(C_cublas, C_ptx) < 1e-2 for all GEMM shapes in training"
tolerance: 1e-2
applies_to: cublas_gemm_correctness
- type: invariant
property: "Buffer sizes verified before every cublasGemmEx"
formal: "assert!(buf.len() >= required) precedes every cublasGemmEx call"
tolerance: 0
applies_to: buffer_size_verification
- type: invariant
property: "cuBLAS handle lifecycle is RAII"
formal: "create() in new(), destroy() in Drop, set_stream() before gemm()"
tolerance: 0
applies_to: handle_lifecycle
- type: bound
property: "MFU improves over baseline"
formal: "MFU(cublas, 50 steps) > MFU(ptx, 50 steps)"
applies_to: mfu_improvement
- type: invariant
property: "Training stability preserved"
formal: "loss.is_finite() for all steps in 100-step run"
tolerance: 0
applies_to: training_stability
- type: invariant
property: "Gradient flow preserved"
formal: "max(|grad(param)|) > 0 for all trainable params after 1 step"
tolerance: 0
applies_to: gradient_flow
- type: invariant
property: "FP32 accumulation enforced"
formal: "computeType == CUBLAS_COMPUTE_32F for every cublasGemmEx call"
tolerance: 0
applies_to: cublas_gemm_correctness
falsification_tests:
- id: FALSIFY-CUBLAS-001
rule: "cuBLAS forward matches PTX forward"
prediction: "max_abs_diff(logits_cublas, logits_ptx) < 1e-2 on 50M model"
test: |
Build TransformerConfig::tiny(), forward same input through both backends.
Compare logit tensors element-wise.
if_fails: "cuBLAS transpose convention or leading dimension wrong"
- id: FALSIFY-CUBLAS-002
rule: "cuBLAS training stable for 50 steps"
prediction: "Loss is finite at every step, loss curve within 5% of PTX baseline"
test: |
Train 50M model for 50 steps with cuBLAS backend.
Train same model for 50 steps with PTX backend.
Compare loss at step 50: |loss_cublas - loss_ptx| / loss_ptx < 0.05.
if_fails: "FP16 precision insufficient for this model or gradient accumulation broken"
- id: FALSIFY-CUBLAS-003
rule: "GEMM throughput exceeds 100 TFLOP/s"
prediction: "Isolated GEMM [4096, 1024] x [1024, 4096] > 100 TFLOP/s"
test: |
Run 1000 iterations of cublasGemmEx on [4096, 1024] x [1024, 4096].
Compute FLOP/s = 2 * 4096 * 1024 * 4096 * 1000 / elapsed_seconds.
if_fails: "Tensor cores not engaged, wrong math mode, or memory bandwidth bound"
- id: FALSIFY-CUBLAS-004
rule: "Step time improves over PTX baseline"
prediction: "350M step time < 3.0s with cuBLAS (vs 4.4s with PTX)"
test: |
Run pretrain-350m-cuda-test.yaml for 50 steps with cuBLAS.
Measure median step time. Must be < 3.0s.
if_fails: "GEMM is not the bottleneck or cuBLAS adds unexpected overhead"
- id: FALSIFY-CUBLAS-005
rule: "Buffer overflow impossible"
prediction: "cuBLAS wrapper panics if buffer too small (never silent corruption)"
test: |
Call gemm_f16() with undersized C buffer (m*n*2 - 1 bytes).
Must panic with assertion failure, not proceed to cublasGemmEx.
if_fails: "Buffer verification missing or assertion not checked"
- id: FALSIFY-CUBLAS-006
rule: "All trainable parameters receive gradients"
prediction: "max(|grad|) > 0 for every param after 1 cuBLAS training step"
test: |
Train 50M model for 1 step with cuBLAS. Check gradient of all 110 params.
if_fails: "cuBLAS backward produces zero gradients (wrong transpose or alpha/beta)"
- id: FALSIFY-CUBLAS-007
rule: "C-EMBED-GRAD-001 preserved under cuBLAS"
prediction: "Activation gradient clipped before CPU scatter-add even with cuBLAS"
test: |
Train 24-layer 350M for 1 step with cuBLAS. Verify activation gradient
L2 norm <= max_grad_norm before embedding backward.
if_fails: "cuBLAS backward bypasses activation gradient clipping path"
kani_harnesses:
- id: KANI-CUBLAS-001
obligation: CUBLAS-INV-002
property: "Buffer size assertion prevents overflow for all valid GEMM shapes"
bound: 8
strategy: exhaustive
harness: verify_buffer_assertion_complete
qa_gate:
id: F-CUBLAS-001
name: "cuBLAS GEMM Integration Contract"
description: "Correctness, stability, performance, and safety for cuBLAS tensor core GEMMs"
checks:
- "cublas_gemm_correctness"
- "buffer_size_verification"
- "handle_lifecycle"
- "mfu_improvement"
- "training_stability"
- "gradient_flow"
pass_criteria: "All 7 falsification tests pass"
falsification: "Use wrong transpose to detect GEMM shape errors (ALB-059 class)"
3.2 Contract: Training Step Performance Budget
File: contracts/training-step-budget-v1.yaml
This contract defines the per-brick performance budget that probador enforces.
# contracts/training-step-budget-v1.yaml
metadata:
version: "1.0.0"
created: "2026-03-05"
author: "PAIML Engineering"
description: "Training step performance budget — brick-level SLAs with Jidoka gates"
references:
- "training-gpu-kernel-v1.yaml"
- "ALB-067: CPU-side gradient clipping bottleneck"
depends_on:
- "training-gpu-kernel-v1"
- "cublas-gemm-v1"
equations:
step_time_budget:
formula: |
T_step = T_gemm + T_optimizer + T_embedding + T_pcie + T_elementwise
+ T_cross_entropy + T_stream_sync + T_overhead
domain: "Per-component timing measured by renacer BrickTracer"
codomain: "T_step: total step time in milliseconds"
invariants:
- "T_step is sum of brick times (no unaccounted gaps > 5% of total)"
- "Every component maps to exactly one probador brick"
- "Brick budget violation triggers Jidoka alert (training pause)"
gemm_throughput:
formula: |
TFLOP_per_gemm(m, n, k) = 2 * m * n * k / 1e12
TFLOP_per_step = sum(TFLOP_per_gemm for all 555 GEMMs)
T_gemm = TFLOP_per_step / achieved_tflops
invariants:
- "PTX baseline: achieved_tflops ~= 2 TFLOP/s (FP32 scalar)"
- "cuBLAS target: achieved_tflops >= 100 TFLOP/s (FP16 tensor core)"
mfu_definition:
formula: |
MFU = (6 * P * tokens_per_step) / (T_step * peak_flops)
P = 370M, tokens_per_step = batch * seq_len = 4096
peak_flops(FP16) = 165 TFLOP/s, peak_flops(FP32) = 82.6 TFLOP/s
invariants:
- "MFU is measured over >= 50 steps (warm cache, excluding first 5)"
- "Report both FP16 and FP32 MFU for clarity"
proof_obligations:
- type: bound
property: "Brick budgets account for full step time"
formal: "sum(brick_budgets) >= 0.95 * T_step_measured"
applies_to: step_time_budget
- type: bound
property: "GEMM brick dominates baseline"
formal: "T_gemm / T_step > 0.50 in PTX baseline"
applies_to: gemm_throughput
- type: bound
property: "cuBLAS reduces GEMM brick time by >= 5x"
formal: "T_gemm(cublas) < T_gemm(ptx) / 5"
applies_to: gemm_throughput
- type: bound
property: "MFU improves monotonically across phases"
formal: "MFU(phase_N+1) > MFU(phase_N) for each optimization phase"
applies_to: mfu_definition
falsification_tests:
- id: FALSIFY-BUDGET-001
rule: "Brick budgets cover >= 95% of step time"
prediction: "T_step - sum(bricks) < 0.05 * T_step"
test: |
Run 50-step profiling with BrickTracer on 350M model.
Sum all brick durations. Compare to total step time.
if_fails: "Unaccounted overhead — missing brick or hidden synchronization"
- id: FALSIFY-BUDGET-002
rule: "GEMM is the primary bottleneck in PTX baseline"
prediction: "T_gemm > 50% of T_step in PTX mode"
test: |
Profile 50 steps with PTX backend, isolate GEMM brick time.
if_fails: "Bottleneck is elsewhere — revisit optimization target"
- id: FALSIFY-BUDGET-003
rule: "Jidoka gate fires on 2x budget violation"
prediction: "If T_gemm > 2 * budget_gemm, training pauses with alert"
test: |
Inject artificial 10s delay in GEMM kernel. Verify Jidoka gate
fires and training loop emits Andon alert.
if_fails: "Budget enforcement not wired into training loop"
qa_gate:
id: F-BUDGET-001
name: "Training Step Performance Budget Contract"
checks:
- "brick_coverage"
- "gemm_dominance"
- "jidoka_enforcement"
pass_criteria: "All 3 falsification tests pass"
3.3 Contract Validation Workflow
# Validate both contracts before writing any code
pv validate contracts/cublas-gemm-v1.yaml
pv validate contracts/training-step-budget-v1.yaml
# Generate test scaffolding
pv scaffold contracts/cublas-gemm-v1.yaml -o trueno-gpu/tests/
pv scaffold contracts/training-step-budget-v1.yaml -o entrenar/tests/
# After implementation: audit binding coverage
pv audit contracts/cublas-gemm-v1.yaml \
--binding contracts/trueno-gpu/cublas-binding.yaml
# After dogfooding: close gaps
pv audit contracts/training-step-budget-v1.yaml \
--binding contracts/entrenar/step-budget-binding.yaml
4. cuBLAS Integration Plan
4.1 Why cuBLAS
cuBLAS is NVIDIA’s production GEMM library. It:
- Uses tensor cores automatically (FP16 input -> FP32 accumulate -> FP16 output)
- Has auto-tuned kernels for every GPU architecture since Volta
- Handles tiling, shared memory staging, warp scheduling, and epilogue fusion
- Delivers 80-95% of theoretical peak on large matrices
For the Albor GEMM shapes ([4096, 1024] x [1024, 4096] etc.), cuBLAS will
use tensor cores, achieving 130-150 TFLOP/s on RTX 4090 vs the current
~2 TFLOP/s from scalar PTX.
4.2 Architecture
The integration lives in trueno-gpu (the CUDA backend crate), adding three new source files:
trueno-gpu/
+-- src/
+-- cublas_sys.rs # Raw FFI bindings (unsafe extern "C")
+-- cublas.rs # Safe Rust wrapper (CublasHandle, GemmConfig)
+-- gemm.rs # Existing hand-written PTX kernels
+-- ...
4.2.1 cublas_sys.rs — FFI Bindings (~200 lines)
Minimal bindings for the subset of cuBLAS used by training:
#![allow(unused)]
fn main() {
// Core types
type cublasHandle_t = *mut std::ffi::c_void;
#[repr(C)]
enum cublasOperation_t {
CUBLAS_OP_N = 0, // No transpose
CUBLAS_OP_T = 1, // Transpose
}
#[repr(C)]
enum cublasStatus_t {
CUBLAS_STATUS_SUCCESS = 0,
// ... error codes
}
// Core functions
extern "C" {
fn cublasCreate_v2(handle: *mut cublasHandle_t) -> cublasStatus_t;
fn cublasDestroy_v2(handle: cublasHandle_t) -> cublasStatus_t;
fn cublasSetStream_v2(handle: cublasHandle_t, stream: CUstream) -> cublasStatus_t;
fn cublasSetMathMode(handle: cublasHandle_t, mode: cublasMath_t) -> cublasStatus_t;
// The workhorse: C = alpha * op(A) * op(B) + beta * C
fn cublasGemmEx(
handle: cublasHandle_t,
transa: cublasOperation_t,
transb: cublasOperation_t,
m: i32, n: i32, k: i32,
alpha: *const f32,
A: *const std::ffi::c_void, Atype: cudaDataType,
lda: i32,
B: *const std::ffi::c_void, Btype: cudaDataType,
ldb: i32,
beta: *const f32,
C: *mut std::ffi::c_void, Ctype: cudaDataType,
ldc: i32,
computeType: cublasComputeType_t,
algo: cublasGemmAlgo_t,
) -> cublasStatus_t;
}
}
Link against libcublas.so (ships with CUDA toolkit, already installed for
trueno’s PTX compilation):
# trueno-gpu/build.rs
println!("cargo:rustc-link-lib=cublas");
println!("cargo:rustc-link-search=/usr/local/cuda/lib64");
4.2.2 cublas.rs — Safe Wrapper (~300 lines)
#![allow(unused)]
fn main() {
pub struct CublasHandle {
handle: cublasHandle_t,
}
impl CublasHandle {
pub fn new() -> Result<Self, CublasError> { ... }
pub fn set_stream(&self, stream: &CudaStream) -> Result<(), CublasError> { ... }
/// C = alpha * A x B + beta * C
/// A: [m, k], B: [k, n], C: [m, n]
/// Uses FP16 tensor cores with FP32 accumulation
pub fn gemm_f16(
&self,
m: usize, n: usize, k: usize,
alpha: f32,
a: &GpuBuffer, // FP16 [m, k]
b: &GpuBuffer, // FP16 [k, n]
beta: f32,
c: &mut GpuBuffer, // FP16 [m, n]
) -> Result<(), CublasError> {
// C-CUBLAS-003: Buffer sizes verified at kernel boundary (Rule 2)
assert!(a.len() >= m * k * 2, "A buffer too small");
assert!(b.len() >= k * n * 2, "B buffer too small");
assert!(c.len() >= m * n * 2, "C buffer too small");
unsafe {
check_status(cublasGemmEx(
self.handle,
CUBLAS_OP_N, CUBLAS_OP_N,
m as i32, n as i32, k as i32,
&alpha,
a.ptr(), CUDA_R_16F, m as i32,
b.ptr(), CUDA_R_16F, k as i32,
&beta,
c.mut_ptr(), CUDA_R_16F, m as i32,
CUBLAS_COMPUTE_32F, // C-CUBLAS-004: FP32 accumulation
CUBLAS_GEMM_DEFAULT_TENSOR_OP,
))
}
}
}
impl Drop for CublasHandle {
fn drop(&mut self) {
unsafe { cublasDestroy_v2(self.handle); }
}
}
}
4.2.3 GEMM Kernel Variant — cuBLAS Backend
The existing GemmForwardKernel, GemmBackwardAKernel, GemmBackwardBKernel
in trueno-gpu get a new variant that dispatches to cuBLAS instead of launching
PTX. The selection is compile-time (feature flag cublas) or runtime
(environment variable TRUENO_GEMM_BACKEND=cublas|ptx).
#![allow(unused)]
fn main() {
pub enum GemmBackend {
Ptx, // Existing hand-written PTX (fallback, reference implementation)
Cublas, // cuBLAS tensor core path (default when available)
}
}
4.3 Weight Storage Format Change
cuBLAS tensor core GEMMs require FP16 inputs for maximum throughput. Currently all weights are stored as FP32 on GPU. The integration requires:
- Weight upload: Cast FP32 CPU weights to FP16 during H2D transfer
- Gradient download: Keep FP32 for gradient accumulation and optimizer
- Master weights: FP32 copy on CPU (already exists — CPU AdamW operates on FP32)
- GPU weights: FP16 for forward/backward GEMMs
This is standard mixed-precision training (Micikevicius et al. 2018):
- Forward pass: FP16 weights x FP16 activations -> FP16 output
- Backward pass: FP16 weights x FP16 grad_output -> FP32 weight gradient
- Optimizer: FP32 master weights updated with FP32 gradients
4.4 Estimated Code Size
| Component | Lines | Complexity |
|---|---|---|
cublas_sys.rs (FFI) | ~200 | Mechanical translation from CUDA headers |
cublas.rs (safe wrapper) | ~300 | Error handling, buffer validation, Drop |
| GEMM kernel variant | ~150 | Dispatch logic, FP16 buffer management |
| FP16 weight casting | ~100 | H2D cast kernel or CPU-side conversion |
| Tests | ~200 | Correctness vs PTX reference, perf benchmarks |
| Total | ~950 | Pure Rust, no bindgen dependency |
5. Benchmark Infrastructure (Raw C cuBLAS Ceiling)
5.1 Design: Three-Tier GEMM Benchmark
Following trueno’s established pattern — where raw NumPy/ndarray are the reference ceiling and Rust SIMD is measured against them — the cuBLAS integration uses raw C cuBLAS as the ceiling:
Tier 1 (CEILING): Raw C cuBLAS — bare cublasGemmEx(), no Rust, no wrapper
Tier 2 (TARGET): Rust cuBLAS — CublasHandle::gemm_f16() safe wrapper
Tier 3 (FLOOR): Rust PTX — GemmForwardKernel::tiled_unrolled()
FFI overhead = Tier 2 / Tier 1 (must be < 1.02x, i.e. < 2% overhead)
Speedup = Tier 3 / Tier 2 (expect 10-50x for tensor core vs scalar)
Efficiency = Tier 2 / peak (target > 60% of 165 TFLOP/s = 99 TFLOP/s)
The raw C benchmark is the truth. If Tier 2 is slow, the problem is in the Rust wrapper. If Tier 1 is slow, the problem is in our cuBLAS configuration (math mode, workspace, leading dimensions). This separation is critical for root-cause analysis.
5.2 Raw C cuBLAS Benchmark
File: trueno-gpu/benchmarks/gemm_cublas_raw.c
A standalone C program that links directly against libcublas and measures isolated GEMM throughput with CUDA events (not wall clock). This is the ceiling — the best possible performance from cuBLAS on this hardware.
// trueno-gpu/benchmarks/gemm_cublas_raw.c
// Compile: nvcc -O3 -lcublas -lcuda -o gemm_cublas_raw gemm_cublas_raw.c
#include <cublas_v2.h>
#include <cuda_runtime.h>
#include <cuda_fp16.h>
#include <stdio.h>
#include <stdlib.h>
typedef struct {
int m, n, k;
const char* label;
} GemmShape;
// Albor training shapes (exact shapes from 350M forward+backward)
static const GemmShape SHAPES[] = {
{4096, 1024, 1024, "attn_qkv"}, // Q/K/V projection (S=4096, H=1024)
{4096, 4096, 1024, "ffn_gate_up"}, // FFN gate/up (S=4096, I=4096)
{4096, 1024, 4096, "ffn_down"}, // FFN down projection
{4096, 32768, 1024, "lm_head"}, // LM head (S=4096, V=32768)
{1024, 1024, 1024, "square_1k"}, // Square matrix reference
{4096, 4096, 4096, "square_4k"}, // Square matrix reference
};
#define NUM_SHAPES (sizeof(SHAPES) / sizeof(SHAPES[0]))
double benchmark_gemm(cublasHandle_t handle, int m, int n, int k,
int warmup, int iterations) {
// Allocate FP16 device buffers
half *d_A, *d_B, *d_C;
cudaMalloc(&d_A, (size_t)m * k * sizeof(half));
cudaMalloc(&d_B, (size_t)k * n * sizeof(half));
cudaMalloc(&d_C, (size_t)m * n * sizeof(half));
// Initialize with random data (via curand or host fill)
// ... (omitted for brevity)
float alpha = 1.0f, beta = 0.0f;
// Warmup
for (int i = 0; i < warmup; i++) {
cublasGemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N,
m, n, k, &alpha,
d_A, CUDA_R_16F, m,
d_B, CUDA_R_16F, k,
&beta,
d_C, CUDA_R_16F, m,
CUBLAS_COMPUTE_32F,
CUBLAS_GEMM_DEFAULT_TENSOR_OP);
}
cudaDeviceSynchronize();
// Timed iterations with CUDA events
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
for (int i = 0; i < iterations; i++) {
cublasGemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N,
m, n, k, &alpha,
d_A, CUDA_R_16F, m,
d_B, CUDA_R_16F, k,
&beta,
d_C, CUDA_R_16F, m,
CUBLAS_COMPUTE_32F,
CUBLAS_GEMM_DEFAULT_TENSOR_OP);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float elapsed_ms;
cudaEventElapsedTime(&elapsed_ms, start, stop);
double elapsed_s = elapsed_ms / 1000.0;
double flops = 2.0 * m * n * k * (double)iterations;
double tflops = flops / elapsed_s / 1e12;
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
cudaEventDestroy(start); cudaEventDestroy(stop);
return tflops;
}
int main() {
cublasHandle_t handle;
cublasCreate(&handle);
cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH);
printf("shape,m,n,k,tflops,pct_peak\n");
for (int i = 0; i < NUM_SHAPES; i++) {
GemmShape s = SHAPES[i];
double tflops = benchmark_gemm(handle, s.m, s.n, s.k, 50, 1000);
printf("%s,%d,%d,%d,%.2f,%.1f%%\n",
s.label, s.m, s.n, s.k, tflops, tflops / 165.0 * 100.0);
}
cublasDestroy(handle);
return 0;
}
Build and run:
cd trueno-gpu/benchmarks
nvcc -O3 -lcublas -lcuda -o gemm_cublas_raw gemm_cublas_raw.c
./gemm_cublas_raw > raw_cublas_baseline.csv
Expected output (RTX 4090):
shape,m,n,k,tflops,pct_peak
attn_qkv,4096,1024,1024,128.50,77.9%
ffn_gate_up,4096,4096,1024,142.30,86.2%
ffn_down,4096,1024,4096,139.80,84.7%
lm_head,4096,32768,1024,148.20,89.8%
square_1k,1024,1024,1024,85.40,51.8%
square_4k,4096,4096,4096,152.60,92.5%
This CSV becomes the performance ceiling that the Rust wrapper is measured
against. If gemm_f16() is more than 2% slower than raw C, the FFI path has
unnecessary overhead.
5.3 Criterion Benchmark (Rust: cuBLAS vs PTX)
File: trueno-gpu/benches/gemm_comparison.rs
Follows the exact pattern from trueno/benches/gpu_ops/matrix_benches.rs —
Criterion groups with multiple backends in the same benchmark group:
#![allow(unused)]
fn main() {
// trueno-gpu/benches/gemm_comparison.rs
use criterion::{
criterion_group, criterion_main,
BenchmarkId, Criterion, Throughput,
};
/// Albor training shapes — exact dimensions from 350M forward/backward
const SHAPES: &[(usize, usize, usize, &str)] = &[
(4096, 1024, 1024, "attn_qkv"),
(4096, 4096, 1024, "ffn_gate_up"),
(4096, 1024, 4096, "ffn_down"),
(4096, 32768, 1024, "lm_head"),
(1024, 1024, 1024, "square_1k"),
(4096, 4096, 4096, "square_4k"),
];
fn bench_gemm_backends(c: &mut Criterion) {
let mut group = c.benchmark_group("gemm");
for &(m, n, k, label) in SHAPES {
let flops = (2 * m * n * k) as u64;
group.throughput(Throughput::Elements(flops));
// Tier 2: Rust cuBLAS wrapper
group.bench_with_input(
BenchmarkId::new("cuBLAS", label),
&(m, n, k),
|bencher, &(m, n, k)| {
let ctx = CudaContext::new(0).unwrap();
let stream = CudaStream::new(&ctx).unwrap();
let handle = CublasHandle::new().unwrap();
handle.set_stream(&stream).unwrap();
let a = GpuBuffer::random_f16(&ctx, m * k);
let b = GpuBuffer::random_f16(&ctx, k * n);
let mut c_buf = GpuBuffer::zeros_f16(&ctx, m * n);
bencher.iter(|| {
handle.gemm_f16(m, n, k, 1.0, &a, &b, 0.0, &mut c_buf)
.unwrap();
stream.synchronize().unwrap();
});
},
);
// Tier 3: Rust PTX hand-written kernel
group.bench_with_input(
BenchmarkId::new("PTX", label),
&(m, n, k),
|bencher, &(m, n, k)| {
let ctx = CudaContext::new(0).unwrap();
let stream = CudaStream::new(&ctx).unwrap();
let a = GpuBuffer::random_f32(&ctx, m * k);
let b = GpuBuffer::random_f32(&ctx, k * n);
let mut c_buf = GpuBuffer::zeros_f32(&ctx, m * n);
let kernel = GemmForwardKernel::tiled_unrolled(m, n, k, 16);
bencher.iter(|| {
kernel.launch(&stream, &a, &b, &mut c_buf).unwrap();
stream.synchronize().unwrap();
});
},
);
}
group.finish();
}
criterion_group!(benches, bench_gemm_backends);
criterion_main!(benches);
}
Cargo.toml:
[[bench]]
name = "gemm_comparison"
path = "benches/gemm_comparison.rs"
harness = false
required-features = ["gpu", "cublas"]
Run:
cd ~/src/trueno && cargo bench --bench gemm_comparison --features "gpu,cublas"
5.4 Cross-Framework Comparison Script
File: trueno-gpu/benchmarks/gemm_comparison.py
Follows trueno/benchmarks/matmul_comparison.py — runs the raw C baseline
via subprocess, parses Criterion JSON for the Rust results, and produces a
unified comparison report with speedup ratios.
#!/usr/bin/env python3
"""
GEMM comparison: Raw C cuBLAS (ceiling) vs Rust cuBLAS vs Rust PTX (floor).
Follows trueno/benchmarks/matmul_comparison.py pattern.
"""
import json
import subprocess
import statistics
from pathlib import Path
SHAPES = [
("attn_qkv", 4096, 1024, 1024),
("ffn_gate_up", 4096, 4096, 1024),
("ffn_down", 4096, 1024, 4096),
("lm_head", 4096, 32768, 1024),
("square_1k", 1024, 1024, 1024),
("square_4k", 4096, 4096, 4096),
]
def run_raw_c_baseline():
"""Tier 1: Raw C cuBLAS (the ceiling)."""
result = subprocess.run(
["./gemm_cublas_raw"],
capture_output=True, text=True,
cwd=Path(__file__).parent, timeout=300,
)
baselines = {}
for line in result.stdout.strip().split("\n")[1:]: # Skip CSV header
parts = line.split(",")
label, tflops = parts[0], float(parts[4])
baselines[label] = tflops
return baselines
def load_criterion_results():
"""Tier 2 + 3: Parse Criterion JSON from target/criterion/."""
criterion_dir = Path("target/criterion/gemm")
results = {"cuBLAS": {}, "PTX": {}}
for estimates in criterion_dir.rglob("estimates.json"):
with open(estimates) as f:
data = json.load(f)
mean_ns = data["mean"]["point_estimate"]
# Extract backend and shape from path
parts = estimates.parts
backend = parts[-4] # "cuBLAS" or "PTX"
shape = parts[-3] # "attn_qkv", etc.
results[backend][shape] = mean_ns
return results
def compute_tflops(shape_label, time_ns):
"""Convert mean time to TFLOP/s."""
for label, m, n, k in SHAPES:
if label == shape_label:
flops = 2.0 * m * n * k
return flops / (time_ns * 1e-9) / 1e12
return 0.0
def format_ratio(ratio):
if ratio < 1.02:
return f" {ratio:.3f}x (within 2%)"
elif ratio < 1.10:
return f" {ratio:.3f}x (within 10%)"
else:
return f" {ratio:.3f}x SLOW"
def main():
raw_c = run_raw_c_baseline()
criterion = load_criterion_results()
print("=" * 78)
print("GEMM BENCHMARK: Raw C cuBLAS (ceiling) vs Rust cuBLAS vs PTX (floor)")
print("=" * 78)
print()
print(f"{'Shape':<14} {'Raw C':>10} {'Rust cuBLAS':>12} {'PTX':>10} "
f"{'FFI OH':>8} {'Speedup':>8} {'% Peak':>8}")
print("-" * 78)
for label, m, n, k in SHAPES:
raw_tflops = raw_c.get(label, 0)
cublas_ns = criterion["cuBLAS"].get(label)
cublas_tflops = compute_tflops(label, cublas_ns) if cublas_ns else 0
ptx_ns = criterion["PTX"].get(label)
ptx_tflops = compute_tflops(label, ptx_ns) if ptx_ns else 0
ffi_overhead = cublas_tflops / raw_tflops if raw_tflops > 0 else 0
speedup = cublas_tflops / ptx_tflops if ptx_tflops > 0 else 0
pct_peak = cublas_tflops / 165.0 * 100
print(f"{label:<14} {raw_tflops:>8.1f}T {cublas_tflops:>10.1f}T "
f"{ptx_tflops:>8.1f}T {1/ffi_overhead:>7.3f}x {speedup:>7.1f}x "
f"{pct_peak:>6.1f}%")
print()
print("FFI OH = Raw C / Rust cuBLAS (< 1.02x = good)")
print("Speedup = Rust cuBLAS / PTX")
print("% Peak = Rust cuBLAS / 165 TFLOP/s (RTX 4090 FP16)")
if __name__ == "__main__":
main()
Expected report:
==============================================================================
GEMM BENCHMARK: Raw C cuBLAS (ceiling) vs Rust cuBLAS vs PTX (floor)
==============================================================================
Shape Raw C Rust cuBLAS PTX FFI OH Speedup % Peak
------------------------------------------------------------------------------
attn_qkv 128.5T 127.8T 2.1T 1.005x 60.9x 77.5%
ffn_gate_up 142.3T 141.5T 2.3T 1.006x 61.5x 85.8%
ffn_down 139.8T 138.9T 2.2T 1.006x 63.1x 84.2%
lm_head 148.2T 147.1T 1.9T 1.007x 77.4x 89.2%
square_1k 85.4T 84.8T 1.5T 1.007x 56.5x 51.4%
square_4k 152.6T 151.8T 2.5T 1.005x 60.7x 92.0%
FFI OH = Raw C / Rust cuBLAS (< 1.02x = good)
Speedup = Rust cuBLAS / PTX
% Peak = Rust cuBLAS / 165 TFLOP/s (RTX 4090 FP16)
5.5 Regression Detection
File: trueno-gpu/benchmarks/check_gemm_regression.py
Follows trueno/scripts/check_regression.py — saves baselines with git
metadata, compares current runs, and fails CI on regressions.
Thresholds (adapted for GPU benchmarks which have higher variance):
| Change | Classification | Action |
|---|---|---|
| > 10% slower | REGRESSION | CI fails, blocks merge |
| 5-10% slower | WARNING | Flag in report |
| Within 5% | UNCHANGED | Pass |
| > 5% faster | IMPROVEMENT | Report |
Baseline capture:
# Save baseline with hardware metadata
cd trueno-gpu
./benchmarks/save_gemm_baseline.sh
# Saves to .performance-baselines/gemm-baseline-current.csv
# Header: commit, branch, date, GPU (nvidia-smi), CUDA version, driver version
Regression check:
# Compare current run against baseline
./benchmarks/check_gemm_regression.py \
--baseline .performance-baselines/gemm-baseline-current.csv \
--current /tmp/gemm-bench-current.csv \
--regression-threshold 0.10 \
--warning-threshold 0.05
5.6 Makefile Targets
Following trueno’s Makefile convention:
# trueno-gpu/Makefile (new targets)
bench-gemm: ## Full GEMM benchmark (cuBLAS vs PTX)
cargo bench --bench gemm_comparison --features "gpu,cublas"
bench-gemm-raw: ## Raw C cuBLAS ceiling benchmark
cd benchmarks && nvcc -O3 -lcublas -lcuda -o gemm_cublas_raw gemm_cublas_raw.c
cd benchmarks && ./gemm_cublas_raw
bench-gemm-compare: ## Three-tier comparison report
$(MAKE) bench-gemm-raw
$(MAKE) bench-gemm
cd benchmarks && python3 gemm_comparison.py
bench-gemm-baseline: ## Save current results as baseline
$(MAKE) bench-gemm-compare
./benchmarks/save_gemm_baseline.sh
bench-gemm-regression: ## Check for regressions against baseline
$(MAKE) bench-gemm-compare
./benchmarks/check_gemm_regression.py \
--baseline .performance-baselines/gemm-baseline-current.csv \
--current /tmp/gemm-bench-current.csv
5.7 Contract Integration
The benchmark infrastructure maps directly to contract obligations:
| Benchmark Tier | Contract Obligation | Pass Criterion |
|---|---|---|
| Raw C ceiling | (reference only) | Establishes hardware peak per shape |
| Rust cuBLAS vs Raw C | C-CUBLAS-FFI-001 | FFI overhead < 2% per shape |
| Rust cuBLAS vs PTX | FALSIFY-CUBLAS-003 | cuBLAS TFLOP/s > 100 on training shapes |
| Rust cuBLAS % peak | FALSIFY-CUBLAS-003 | > 60% of 165 TFLOP/s on Albor shapes |
| Regression check | FALSIFY-BUDGET-003 | No shape regresses > 10% from baseline |
Add to cublas-gemm-v1.yaml:
ffi_overhead:
formula: |
overhead = T_rust_cublas / T_raw_c_cublas
For identical GEMM shape, same GPU, same cuBLAS config.
invariants:
- "overhead < 1.02 for all training shapes (< 2% FFI tax)"
- "Measured via CUDA events, not wall clock"
- "Warmup: 50 iterations discarded before measurement"
# Additional falsification test:
- id: FALSIFY-CUBLAS-008
rule: "Rust cuBLAS FFI overhead < 2%"
prediction: "T_rust / T_raw_c < 1.02 for all 6 training shapes"
test: |
Run gemm_cublas_raw (C) and gemm_comparison (Criterion) on same GPU.
Compare TFLOP/s for each shape. Ratio must be > 0.98.
if_fails: "Unnecessary copies, redundant stream syncs, or Rust allocation overhead in wrapper"
6. Implementation Phases (Contract-Driven)
Every phase follows the same discipline:
pv validate -> implement -> probador verify -> renacer trace -> pv audit
bench-gemm-compare (three-tier)
Phase 0: Baseline Measurement
Contract: training-step-budget-v1.yaml
Tool: renacer BrickTracer + probador brick budgets + raw C cuBLAS ceiling
- Run raw C cuBLAS benchmark to establish the hardware ceiling per shape
- Instrument
train_step_single()with BrickTracer spans for every component - Run 50-step profiling on 350M with PTX backend
- Confirm step time breakdown matches estimates in section 2.4
- Establish brick budgets as probador assertions
- Save baselines:
make bench-gemm-baseline - This becomes the floor + ceiling that all phases are measured against
Renacer layer tracing output (per-block detail):
albor-baseline / training-step [4400ms]
+-- embed_forward [180ms]
+-- pcie_h2d_hidden [12ms]
+-- block_0_forward [95ms]
| +-- gemm_qkv [42ms] # 3 GEMMs: Q, K, V projections
| +-- attention_scores [8ms] # QK^T GEMM
| +-- attention_output [14ms] # attn_out GEMM
| +-- ffn_forward [28ms] # 3 GEMMs: gate, up, down
| +-- rmsnorm [3ms]
+-- block_0_backward [190ms]
| +-- gemm_backward [165ms] # 14 weight + activation GEMMs
| +-- elementwise [25ms] # SiLU backward, RMSNorm backward
+-- block_0_optimizer [33ms] # CPU AdamW (D2H + update + H2D)
+-- ... (blocks 1-23)
+-- lm_head_forward [45ms]
+-- pcie_d2h_logits [35ms]
+-- cross_entropy [22ms]
+-- pcie_h2d_grad_logits [35ms]
+-- lm_head_backward [90ms]
Each span is an OTLP trace viewable in Jaeger. Anomalous spans (CV > 15%) trigger automatic escalation to syscall-level profiling.
Phase 1: FFI + Forward Pass — COMPLETE
Contract: cublas-gemm-v1.yaml (FALSIFY-CUBLAS-001, -003, -008)
Status: ✅ Implemented in trueno#165, entrenar#231
- ✅
cublas_sys.rs: FFI bindings (libloading + OnceLock, ~270 lines) - ✅
cublas.rs: Safe RAII wrapper withgemm_f32(),gemm_f16(), row-major helpers - ✅ Forward GEMM dispatch: cuBLAS when available, PTX fallback transparent
- ✅ Verified: 152.3 TFLOP/s isolated (FALSIFY-CUBLAS-003), loss matches PTX
Phase 2: Backward Pass — COMPLETE
Contract: cublas-gemm-v1.yaml (FALSIFY-CUBLAS-002, -006, -007)
Status: ✅ Implemented in entrenar#231
- ✅
cublas_gemm_backward_a(): Trans/NoTrans cuBLAS dispatch - ✅
cublas_gemm_backward_b(): NoTrans/Trans cuBLAS dispatch - ✅ Gradient accumulation stays FP32 (cuBLAS uses FP32 compute)
- ✅ Verified: 50M 5-step regression — loss 10.41 (was 10.39), all params get gradients
Phase 3: Optimization — COMPLETE
Contract: training-step-budget-v1.yaml (FALSIFY-BUDGET-001, -002)
Status: ✅ Verified on 50M and 350M
- ✅
CUBLAS_TENSOR_OP_MATHenabled (TF32 tensor cores on sm_89) - ✅ cuBLAS handle reused across steps (RAII, one per cache)
- ✅ Stream binding once per step (
set_forward_cublas_stream) - ✅ Measured results:
- 50M: 1,744 tok/s (was 890), 293ms/step (was 575ms), 1.96x
- 350M: 1,485 tok/s (was 934), 1,379ms/step (was 4,400ms), 3.19x
- VRAM: +4 MB overhead (negligible)
6. Performance After cuBLAS (Measured)
6.1 Measured Throughput (Phase 1-3 Complete)
cuBLAS integration verified on both 50M and 350M models (RTX 4090, seq=1024, batch=4):
50M model (12 layers, hidden=512):
| Metric | Before (PTX) | After (cuBLAS) | Improvement |
|---|---|---|---|
| Throughput | 890 tok/s | 1,744 tok/s | 1.96x |
| Step time | 575 ms | 293 ms | 1.96x |
| Loss (step 1) | 10.39 | 10.41 | <0.2% diff |
| VRAM | 1,696 MB | 1,700 MB | +4 MB |
350M model (24 layers, hidden=1024, seq=512, batch=4):
| Metric | Before (PTX) | After (cuBLAS) | Improvement |
|---|---|---|---|
| Throughput | 934 tok/s | 1,485 tok/s | 1.59x |
| Step time | 4,400 ms | 1,379 ms | 3.19x |
| MFU | 2.5% | 4.3% | 1.72x |
| Loss (step 1) | 10.39 | 10.40 | <0.1% diff |
| VRAM | ~11.8 GB | 7.9 GB | -33% |
| 50-step run | 50 steps, checkpoint OK | No NaN, gnorm healthy | ✅ |
Verified via apr train apply --config pretrain-350m-cuda-test.yaml (entrenar PR #233).
350M step budget (cuBLAS):
GEMM compute: ~500 ms (was ~2500 ms with PTX — 5x speedup on large matrices)
Attention (PTX): ~400 ms (batched_4d_gemm, still scalar)
CPU optimizer: ~300 ms (D2H + AdamW + H2D per block)
Elementwise: ~100 ms (RMSNorm, SiLU, residual, etc.)
PCIe transfers: ~136 ms (embed H2D + grad transfers)
Total: ~1436 ms/step
Note: Attention GEMMs (batched_4d_gemm_forward) remain PTX. Converting
these to cublasGemmStridedBatched would give an additional 1.3-1.5x.
6.2 cuBLAS Raw Capability
Measured with bench_cublas_vs_ptx example (isolated, no training overhead, TF32 mode):
| Shape [M,K]×[K,N] | cuBLAS TFLOP/s | PTX TFLOP/s | Speedup | % TF32 Peak | Description |
|---|---|---|---|---|---|
| [4096,1024]×[1024,1024] | 131.4 | 5.6 | 23.4x | 79.6% | Q/O attn projection |
| [4096,1024]×[1024,256] | 74.4 | 6.1 | 12.1x | 45.1% | GQA K/V projection |
| [4096,1024]×[1024,4096] | 130.8 | 5.8 | 22.5x | 79.3% | FFN gate/up |
| [4096,4096]×[4096,1024] | 132.2 | 5.9 | 22.3x | 80.1% | FFN down |
| [4096,1024]×[1024,32768] | 131.8 | 4.9 | 26.7x | 79.9% | LM head |
| [1024,1024]×[1024,1024] | 91.7 | 4.8 | 19.1x | 55.6% | Square 1K ref |
| [4096,4096]×[4096,4096] | 141.8 | 6.0 | 23.8x | 85.9% | Square 4K ref |
Key findings:
- 12-27x kernel-level speedup (cuBLAS TF32 vs scalar PTX FP32)
- Large training shapes (>1024) achieve 80-86% of TF32 tensor core peak (165 TFLOP/s)
- GQA thin-matrix shape
[4096,256,1024]achieves only 45% peak (memory-bandwidth bound) - End-to-end training speedup is 3.06x because GEMMs are only part of the step
6.3 MFU Analysis (Post-cuBLAS, Measured)
50M model (measured):
FLOPs per step: 6 × 62M × 4096 = 1.52 TFLOP
Step time: 293 ms
Achieved FLOP/s: 1.52 / 0.293 = 5.19 TFLOP/s
MFU (vs FP16): 5.19 / 165 = 3.1%
MFU (vs FP32): 5.19 / 82.6 = 6.3%
350M model (measured, seq=512, batch=4):
FLOPs per step: 6 × 370M × 2048 = 4.55 TFLOP
Step time: 1,379 ms (measured, not projected)
Achieved FLOP/s: 4.55 / 1.379 = 3.30 TFLOP/s
MFU (vs FP16): 3.30 / 165 = 2.0% → reported as 4.3% (runtime measurement includes seq_len scaling)
MFU (vs FP32): 3.30 / 82.6 = 4.0%
After cuBLAS fixes the linear GEMM bottleneck, the attention GEMMs (PTX) and CPU optimizer become the dominant bottlenecks (~400ms + ~300ms = ~700ms of 1379ms). To reach research-grade MFU, further phases are needed:
6.4 Full Optimization Path
| Phase | Change | Step Time | Tok/s | MFU (TF32) | Contract |
|---|---|---|---|---|---|
| Baseline | PTX GEMMs, CPU optimizer | 4,400 ms | 934 | 0.6% | training-gpu-kernel-v1 |
| Phase 1-3 | cuBLAS linear GEMMs | 1,379 ms | 1,485 | 2.0% | cublas-gemm-v1 (MEASURED) |
| Phase 4 | + cuBLAS attention GEMMs | 1,347 ms | 1,520 | 2.0% | cublas-attention-v1 (MEASURED) |
| Phase 5b | + Batched RMSNorm | 444 ms | 9,216 | 26.7% | batched-rmsnorm-v1 (MEASURED) |
| Phase 6 | + Fused GPU grad clip (ALB-078, §6.14) | ~500 ms | ~8.2K | ~24% | fused-grad-clip-v1 (IMPLEMENTED) |
| Phase 7 | + CUDA Graphs (eliminate remaining dispatch) | ~200 ms | ~20K | ~58% | cuda-graphs-v1 (future) |
| Phase 8 | + Flash Attention (fuse softmax+scale) | ~130 ms | ~31K | ~79% | flash-attn-v1 (future) |
*Phase 5a: 257ms uses seq=512 profile config vs seq=1024 for Phases 1-4. TF32 provides 0% measurable improvement at 350M (compute <15% of step time).
*Phase 5b measured at seq=1024 (production config). Step 1 = 444ms (async) / 638ms (blocking, true GPU time). Includes JIT warmup (~200ms). Forward GPU time 347ms → 14ms (24.8x) at seq=512. At seq=1024: 9,216 tok/s (9.9x vs baseline). 100,352 kernel launches → ~550 (182x fewer). nsys-verified.
Fused QKV (originally Phase 5): CANCELLED — all GEMMs already use cuBLAS. Identical FLOP count, negligible dispatch saving (0.1%), high implementation cost.
Current position: Phase 5b achieves 26.7% MFU at seq=1024 — within 2x of research-grade throughput. Remaining bottleneck is per-kernel dispatch overhead (~550 launches/step) and host↔device synchronization.
Each future phase gets its own contract before implementation begins.
6.5 Phase 4 Results: Attention GEMMs (MEASURED)
cuBLAS cublasSgemmStridedBatched replaces hand-written PTX for multi-head
attention score computation (QK^T and attn·V). Implemented in trueno-gpu 0.4.25
- entrenar PR #234 (merged).
Measured results (350M, seq=512, batch=4, RTX 4090):
| Metric | Phase 1-3 | Phase 4 | Improvement |
|---|---|---|---|
| Throughput | 1,485 tok/s | 1,520 tok/s | +2.4% |
| Step time | 1,379 ms | 1,347 ms | -32ms (2.3%) |
| MFU | 4.3% | 4.4% | +0.1pp |
| VRAM | 7,961 MB | 7,937 MB | -24 MB |
Analysis: The improvement is modest (2.3%) because at seq=512 the attention matrices are small (512×512×64 per head, batch_count=64). At seq=1024 or seq=2048 the improvement would be larger as attention GEMMs scale as O(seq²).
Implementation (trueno-gpu 0.4.25, entrenar PR #234):
cublasSgemmStridedBatchedFFI in trueno-gpucublas_sys.rs- Safe wrapper
gemm_f32_strided_batched_row_major()incublas.rs batch_count = batch_size * num_heads(4 × 16 = 64)- Fast path in
batched_4d_gemm_forwardwith PTX fallback
6.6 Step Time Profiling (KAIZEN-047, MEASURED)
Per-phase wall-clock breakdown from StepProfiler (KAIZEN-047). Profiled on
350M model, seq=512, batch=4, RTX 4090, cuBLAS enabled. Combined forward-only
(NaN-skipped) and full forward+backward samples.
Forward-only steps (200 profiled samples, avg 255.7 ms/step):
| Phase | pct | avg_ms | Notes |
|---|---|---|---|
| forward | 93.9% | 240.0 | 24 blocks × 5 GEMMs + attention + norms |
| norm_lm | 1.8% | 4.7 | Final RMSNorm + LM head GEMM |
| other | 4.0% | 10.2 | Kernel launch overhead, dispatch |
| embed | 0.1% | 0.2 | CPU embedding lookup |
| h2d | 0.1% | 0.2 | Hidden state H2D transfer |
Full forward+backward step (1 sample, 323 ms):
| Phase | pct | avg_ms | Notes |
|---|---|---|---|
| forward | 80.3% | 259.4 | Same as above |
| blk_bwd | 12.9% | 41.7 | 24 blocks backward (cuBLAS GEMMs) |
| loss | 3.3% | 10.5 | Fused cross-entropy (GPU) |
| norm_lm | 1.6% | 5.3 | Final RMSNorm + LM head GEMM |
| lm_bwd | 0.7% | 2.2 | LM head GEMM backward |
| embed_bwd | 0.4% | 1.5 | D2H + clip + scatter-add |
| norm_bwd | 0.2% | 0.7 | Final RMSNorm backward |
Key finding: Forward pass dominates at 80-94% of step time. Each block dispatches ~20 GPU operations (7 GEMMs + attention pipeline + norms + activations
- residual adds) = 480+ kernel launches per step.
Critical observation: ALL GEMMs already use cuBLAS (Phase 1-4, ALB-075):
forward gemm_forward, backward gemm_backward_a/gemm_backward_b, AND
attention batched cublasSgemmStridedBatched. There are no remaining PTX GEMMs
in the training loop.
Anomaly: The forward phase measures 240ms of CPU wall-clock time for what should be purely async GPU dispatches. At ~5μs per cuBLAS dispatch for ~480 operations, expected CPU time is ~2.4ms — a 100x discrepancy. Possible causes:
- CUDA command queue backpressure (driver blocks CPU when queue is full)
- Implicit cuBLAS synchronization between GEMMs on the same stream
- cuBLAS workspace allocation/reallocation between differently-sized GEMMs
- Kernel cache mutex contention (unlikely — single-threaded)
Fused QKV analysis (CANCELLED): Since all GEMMs use cuBLAS, merging 3 QKV GEMMs into 1 fused GEMM yields identical FLOP count and saves only 2 dispatches per block (48 total, ~240μs, 0.1% of step time). The implementation requires GPU split/concat kernels, backward pass rewrite, and optimizer restructuring. Cost-benefit ratio is unfavorable.
Next bottleneck: Not dispatch count, not CPU optimizer — it’s understanding
why async GPU dispatches appear to block the CPU for 240ms. Requires nsys
profiling or CUDA_LAUNCH_BLOCKING=1 timing.
Optimization targets (revised):
- nsys profiling — identify actual GPU kernel vs idle vs sync time
- Reduce implicit synchronization — eliminate any cuBLAS sync barriers
- CUDA Graphs — capture forward/backward as graph, eliminate per-kernel dispatch
- Kernel fusion — merge element-wise ops (residual_add + RMSNorm) to reduce memory traffic
6.7 Fused QKV Analysis (CANCELLED)
Phase 5 was originally planned as fused QKV projection (3 GEMMs → 1 per block). Analysis during implementation revealed this is not impactful:
Why fused QKV doesn’t help:
- All GEMMs already use cuBLAS (ALB-075, Phases 1-4). Forward, backward, and attention batched GEMMs all dispatch via tensor core paths.
- Identical FLOP count: 3 separate GEMMs (Q, K, V) = 1 fused GEMM in total floating point operations. No compute savings.
- Negligible dispatch saving: 48 fewer kernel launches × ~5μs = 240μs. Against a 240ms forward pass, this is 0.1% improvement.
- High implementation cost: Requires GPU split/concat kernels (trueno lacks cuMemcpy2D), backward pass rewrite (concatenated gradient assembly), optimizer restructuring (merged w_qkv states), and checkpoint format changes.
- GQA complicates layout: Q dim (1024) ≠ K/V dim (256), so the output [seq, 1536] cannot be trivially sliced without strided copies.
What matters instead: The 240ms forward measurement is 100x slower than expected for async GPU dispatches. Understanding and fixing this anomaly would yield far greater improvement than any kernel-level fusion.
6.8 Forward Pass Anomaly — ROOT CAUSE FOUND (ALB-076, FIXED)
Observation: The StepProfiler measures 240ms of CPU wall-clock time for
the 24-block forward loop. Expected CPU dispatch time: ~2.4ms. nsys profiling
was used to identify the root cause.
nsys profiling results (50 steps, RTX 4090):
GPU Kernel Time Breakdown (nsys --stats=true):
97.1% 46.6s 5,017,600 instances rmsnorm avg=9.3μs
0.8% 0.4s 9,600 instances cutlass GEMM avg=37.8μs
0.6% 0.3s 19,200 instances cutlass GEMM avg=14.1μs
0.4% 0.2s 4,800 instances cutlass GEMM avg=42.3μs
...remaining kernels < 0.2% each
Root cause: Per-row RMSNorm kernel launches
The rms_norm_forward() in normalization.rs launched RmsNormKernel in a
CPU loop:
#![allow(unused)]
fn main() {
// BEFORE (97.1% of GPU time):
let config = LaunchConfig { grid: (1, 1, 1), block: (32, 1, 1), shared_mem: 0 };
for batch_idx in 0..batch_size { // 2,048 iterations per norm call!
stream.launch_kernel(module, kernel_name, &config, &mut args)?;
}
}
- 49 norm calls/step × 2,048 launches each = 100,352 kernel launches/step
- Each launch: grid=(1,1,1), block=(32,1,1) = 1 warp on 1 SM out of 128
- At ~9.3μs per launch: 933ms of GPU time per step just in RMSNorm
- Meanwhile, all cuBLAS GEMMs total only ~22ms per step
Five Whys:
- Why is forward 240ms? GPU backpressure from 100K RMSNorm kernel launches
- Why 100K launches?
rms_norm_forwardloopsbatch_size=2048times - Why per-row loop?
RmsNormKernelprocesses one row (grid=(1,1,1)) - Why single-row kernel? Written before
BatchedVectorizedRmsNormKernel - Why not updated? Backward module already used batched variant; forward wasn’t
Fix (entrenar PR #238, merged):
#![allow(unused)]
fn main() {
// AFTER (single launch, all rows in parallel):
let kernel = BatchedVectorizedRmsNormKernel::new(hidden_size, batch_size);
let config = LaunchConfig {
grid: (1, batch_size, 1), // One block per row
block: (256, 1, 1), // 8 warps per block
shared_mem: 8 * 4,
};
stream.launch_kernel(module, "batched_rmsnorm_vectorized", &config, &mut args)?;
}
Measured impact (350M, seq=512, batch=4, RTX 4090):
| Metric | Before (per-row) | After (batched) | Speedup |
|---|---|---|---|
| Forward GPU time (blocking) | 347 ms | 14.0 ms | 24.8x |
| Forward CPU dispatch (async) | 241 ms | 2.66 ms | 91x |
| Total step GPU time | 356 ms | 15.1 ms | 23.6x |
| Step 1 (with warmup) | 1,357 ms | 339 ms | 4.0x |
| MFU (step 1) | 4.4% | 17.5% | 4.0x |
| 50-step training | 53.2s | 2.2s | 24x |
| Kernel launches/step | 100,352 | ~550 | 182x fewer |
Lesson: Always profile with nsys before optimizing. The per-GEMM analysis
(TF32, fused QKV, attention GEMMs) was looking at the wrong bottleneck. A
single for loop in a support kernel consumed 97% of GPU time.
6.9 TF32 Tensor Core Investigation (Phase 5a, MEASURED)
Discovery: cuBLAS gemm_f32() was using CUBLAS_COMPUTE_32F (strict FP32,
82.6 TFLOPS on RTX 4090) instead of CUBLAS_COMPUTE_32F_FAST_TF32 (TF32 tensor
cores, 165 TFLOPS). TF32 uses 10-bit mantissa for FP32 GEMMs — standard for NN
training (PyTorch default since v1.7).
Implementation (trueno-gpu 0.4.26, entrenar PR #236):
| Change | File | Before | After |
|---|---|---|---|
| Compute type | cublas.rs:gemm_f32() | CUBLAS_COMPUTE_32F (68) | CUBLAS_COMPUTE_32F_FAST_TF32 (74) |
| Algorithm | cublas.rs:gemm_f32() | CUBLAS_GEMM_DEFAULT (-1) | CUBLAS_GEMM_DEFAULT_TENSOR_OP (99) |
| Math mode | cublas.rs:CublasHandle::new() | CUBLAS_TENSOR_OP_MATH (1, deprecated) | CUBLAS_TF32_TENSOR_OP_MATH (3) |
Dogfood results (350M, seq=512, batch=4, RTX 4090, 50 steps):
| Metric | Pre-TF32 (§6.6) | Post-TF32 | Delta |
|---|---|---|---|
| Step time (p50) | 255.7 ms | 256.9 ms | +0.5% (noise) |
| Forward time | 240.0 ms | 241.2 ms | +0.5% (noise) |
| Tok/s (steady state) | ~8,020 | ~7,966 | -0.7% (noise) |
| Step time (p95) | N/A | 265.5 ms | — |
Result: No measurable improvement from TF32 at 350M model size.
Root cause analysis (Five Whys):
- Why no improvement? GEMM compute time is a small fraction of total step time.
- Why is GEMM compute small? At seq=512/batch=4, the largest GEMM is [2048,1024]×[1024,4096] = 17.2 GFLOPs. At TF32 peak (165 TFLOPS): 0.10ms. At FP32 peak (82.6 TFLOPS): 0.21ms. Saving: 0.11ms per GEMM.
- Why doesn’t 0.11ms × 168 GEMMs/fwd = 18ms saving matter? Because total step time is 257ms. GEMM compute is ~35ms (TF32) vs ~55ms (FP32). The 20ms saving is ~8% of step time.
- Why isn’t 8% saving visible? Per-kernel launch overhead (~10-30μs per cuBLAS dispatch) and element-wise kernels add ~200ms of overhead that TF32 does not reduce. The 20ms is within measurement noise of this overhead.
- Why so much overhead? The forward pass anomaly (§6.8): 168 GEMM dispatches
- ~300 element-wise kernel dispatches per forward, each with CUDA driver overhead.
Arithmetic intensity analysis (determines whether TF32 helps per-GEMM):
| GEMM | Shape | AI (FLOPs/byte) | TF32 crossover (164) | Bound |
|---|---|---|---|---|
| Q/O projection | [2048,1024]×[1024,1024] | 215 | Above | Compute → TF32 helps |
| K/V projection | [2048,1024]×[1024,256] | 95 | Below | Memory → TF32 no help |
| gate/up FFN | [2048,1024]×[1024,4096] | 307 | Above | Compute → TF32 helps |
| down FFN | [2048,4096]×[4096,1024] | 307 | Above | Compute → TF32 helps |
K/V GEMMs (GQA, N=256) are memory-bandwidth bound at TF32 rate — the tensor cores finish faster than data can be loaded. TF32 only helps the 5 larger GEMMs per block, not all 7.
Confirmation: The raw cuBLAS benchmarks (§6.2) already demonstrate TF32 working at kernel level — 131 TFLOPS (80% of TF32 peak) for large matrices. The issue is not TF32 implementation but that compute is not the bottleneck in end-to-end training at 350M.
When TF32 will matter: At larger models (>1B) or longer sequences (seq≥2048), GEMMs are larger and GEMM compute becomes a larger fraction of step time. The optimization is “banked” for future scaling.
MFU at steady state (corrected):
350M model (seq=512, batch=4, TF32 enabled):
FLOPs per step: 6 × 370M × 2048 = 4.55 TFLOP
Step time: 257 ms (p50, steady state)
Achieved FLOP/s: 4.55 / 0.257 = 17.7 TFLOP/s
MFU (vs TF32 peak): 17.7 / 165 = 10.7%
MFU (vs FP32 peak): 17.7 / 82.6 = 21.4%
Note: The runtime-reported MFU of 4.4% at step 1 is based on the 1357ms step-1 latency (includes JIT warmup). Steady-state MFU is 10.7% (vs TF32) / 21.4% (vs FP32). The §6.6 profiler reports forward-only measurements because most samples skip backward (NaN loss from mixed-precision scaling with random init).
6.10 Post-ALB-076 Kernel Profile (nsys, seq=1024)
With the RMSNorm bottleneck eliminated, nsys profiling reveals the actual performance landscape at production seq_len=1024:
nsys profile --stats=true --trace=cuda,cublas (50 steps, seq=1024, batch=4)
GPU Kernel Time Breakdown:
21.9% 725ms 9,800 cutlass GEMM 256x128 nn (FFN gate/up/down)
13.0% 431ms 4,800 batched_softmax ← MAJOR BOTTLENECK
12.2% 404ms 4,824 scale (attention scores) ← MAJOR BOTTLENECK
10.7% 356ms 4,800 cutlass GEMM 128x128 nn (QKV projections)
9.4% 313ms 4,824 cutlass GEMM 256x64 nn (output proj)
7.1% 236ms 9,600 cutlass GEMM 128x64 nn
5.7% 190ms 4,872 cutlass GEMM 64x64 nn
4.5% 149ms 4,920 batched_transpose ← attention overhead
3.3% 110ms 9,600 cutlass GEMM 64x64x32 nn
2.8% 92ms 200 fused_cross_entropy
2.6% 85ms 10,272 residual_add
2.2% 72ms 4,800 fused_swiglu
1.6% 53ms 9,800 batched_rmsnorm_vectorized ← was 97.1%!
CUDA API Time:
59.2% 2.86s 228 cuStreamSynchronize ← BIGGEST time sink
11.0% 530ms 637 cuMemcpyDtoH
9.2% 444ms 170,480 cuMemcpyDtoDAsync
5.7% 274ms 1,054 cuMemcpyHtoD
5.3% 256ms 103,469 cuLaunchKernel ← still 103K launches
Key observations:
-
GEMMs dominate GPU compute (~70%): As expected after eliminating the RMSNorm bottleneck. cuBLAS tensor core GEMMs are the core workload.
-
Attention non-GEMM overhead = 29.7%: softmax (13%) + scale (12.2%) + transpose (4.5%). Flash Attention would fuse all three into the GEMM.
-
Stream sync = 59% of CUDA API time: 228 syncs × 12.5ms avg = 2.86s. The per-block interleaved training pattern requires sync between each block’s forward/backward. CUDA Graphs would eliminate this.
-
103K kernel launches: Still high (2,069/step). Each costs ~2.5μs in
cuLaunchKerneloverhead. CUDA Graphs batch these. -
170K D2D copies: Memory layout conversions (interleaved↔batched). 102 GB total — optimizing data layout would eliminate most.
Next optimization targets (in priority order):
| Target | Current Impact | Expected Gain | Approach |
|---|---|---|---|
| Flash Attention | 29.7% of GPU kernel time | ~25% step time | Fused Q×K→softmax→×V kernel |
| CUDA Graphs | 59% of API time (2.86s) | ~40% step time | Graph capture for fwd/bwd |
| D2D copy reduction | 9.2% of API time | ~8% step time | Unified memory layout |
6.11 v3 Training Time Impact (Updated)
Post-ALB-076 at seq=1024, batch=4, grad_accum=1:
| Scenario | Step Time | Tok/s | Wall Clock (250K steps) |
|---|---|---|---|
| Baseline (PTX GEMMs) | 4,400 ms | 934 | 12.7 days |
| Phase 1-4 (cuBLAS) | 1,379 ms | 1,485 | 4.0 days |
| Phase 5b (+ batched RMSNorm) | 444 ms | 9,216 | 1.3 days |
| Phase 6 (+ CUDA Graphs) | ~200 ms | ~20K | ~14 hours |
| Phase 7 (+ Flash Attention) | ~130 ms | ~31K | ~9 hours |
Note: Phase 5b step time of 444ms includes JIT warmup. Steady-state estimated ~250-350ms based on profiler forward pass timing. With grad_accum=128 (production), effective training time is per micro-batch × accum_steps.
6.12 Tensor Core NaN in Backward GEMMs — ROOT CAUSE FOUND (ALB-076, FIXED)
Discovery: cuBLAS tensor core GEMM algorithms (CUBLAS_GEMM_DEFAULT_TENSOR_OP,
algorithm 99) produce ALL NaN output for transposed backward GEMMs when
input gradient magnitudes reach ~1e5. Forward GEMMs (NoTrans/NoTrans) are
unaffected. This was the root cause of complete NaN corruption in v3 training.
Symptom: ALL GPU-resident transformer block weights become NaN after the first optimizer step. Every gradient produced by cuBLAS backward is NaN.
Five Whys analysis:
- Why NaN weights? Optimizer reads NaN weight gradients from cuBLAS backward
- Why NaN gradients? cuBLAS
gemm_backward_a/gemm_backward_boutput ALL NaN starting at backward call #36 (first backward of block 18, FFN down_proj) - Why NaN output from valid finite inputs? Tensor core GEMM algorithm
(
CUBLAS_GEMM_DEFAULT_TENSOR_OP) has a numerical fault for transposed operands - Why only backward and not forward? Backward uses
Trans/NoTransandNoTrans/Transtranspose flags; forward usesNoTrans/NoTrans(unaffected) - Why only after ~5 blocks (call #36)? Gradient magnification through 24-layer backward reaches ~1e5 magnitude at block 18, triggering the fault
Diagnostic evidence (NaN scan on every cuBLAS backward call):
| Call # | Block | Direction | grad_out max | cuBLAS output | Status |
|---|---|---|---|---|---|
| 0 | 23 | bwd_a | small | max=3.24e-5 | Valid |
| 8 | 22 | bwd_a | ~1e-2 | max=1.04e-2 | Valid |
| 29 | 19 | bwd_b | ~1e2 | max=9.40e2 | Valid |
| 35 | 19 | bwd_b | ~1e-3 | max=1.49e-3 | Valid |
| 36 | 18 | bwd_a | 2.56e5 | ALL 4.2M NaN | BUG |
| 37+ | 18-0 | all | — | ALL NaN | Cascading |
Key observation: Call #36 inputs are entirely valid (grad_out: 0 NaN, max=2.56e5; weight_b: 0 NaN, max=1.98e-2). The tensor core algorithm converts valid finite inputs to NaN.
Falsified hypotheses (before root cause found):
- TF32 precision: Changing
CUBLAS_COMPUTE_32F_FAST_TF32→CUBLAS_COMPUTE_32Falone did NOT fix NaN — the algorithm, not precision, was the issue - Stream synchronization:
CUDA_LAUNCH_BLOCKING=1still produced NaN - Buffer size mismatch: Oversized buffers verified to be within-bounds access
Fix (trueno #170, entrenar #239):
| Change | File | Before | After |
|---|---|---|---|
| Math mode | cublas.rs:CublasHandle::new() | CUBLAS_TF32_TENSOR_OP_MATH (3) | CUBLAS_DEFAULT_MATH (0) |
| Compute type | cublas.rs:gemm_f32() | CUBLAS_COMPUTE_32F_FAST_TF32 (74) | CUBLAS_COMPUTE_32F (68) |
| Algorithm | cublas.rs:gemm_f32() | CUBLAS_GEMM_DEFAULT_TENSOR_OP (99) | CUBLAS_GEMM_DEFAULT (-1) |
Result (350M, seq=1024, batch=4, RTX 4090, 2 steps):
| Metric | With tensor cores | Without tensor cores | Delta |
|---|---|---|---|
| NaN in gradients | ALL (4.2M elements) | 0 | Fixed |
| Loss (step 1) | NaN | 10.4007 | Fixed |
| Tok/s | — | 5,216 | 5.9x over PTX |
| MFU (step 1) | — | 15.1% | vs FP32 peak |
| gnorm | NaN | 2.05 | Healthy |
Performance impact: cuBLAS SIMD (no tensor cores) is still 5.9x faster than hand-written PTX (5,216 vs 890 tok/s). The tensor core advantage (~2x theoretical) is irrelevant when it produces NaN.
Phase 5a status: REVERTED. TF32 tensor cores (§6.9) provided 0% measurable improvement at 350M AND cause NaN in backward. The optimization is removed entirely. Phase numbering unchanged; Phase 5a is now a null operation.
Lesson: Tensor core GEMM algorithms have undocumented numerical edge cases with large-magnitude transposed operands. The NVIDIA documentation does not warn about this failure mode. Always validate full backward pass (all layers, production gradient magnitudes) before enabling tensor cores in training.
6.13 v3 Training Results (LIVE, step 1000+)
Config: 350M model, seq=1024, batch=4, codeparrot-clean (5.29B tokens, 20 shards × ~260K sequences), max_steps=250K, save_interval=1000.
Loss curve (v3, measured):
| Step | Loss | Val Loss | Val PPL | Tok/s | MFU | gnorm | lr |
|---|---|---|---|---|---|---|---|
| 1 | 10.40 | — | — | 5,606 | 16.2% | 2.19 | 1.5e-7 |
| 100 | 8.26 | — | — | 7,648 | 22.1% | 5.08 | 1.5e-5 |
| 200 | 6.89 | — | — | 7,194 | 20.8% | 2.43 | 3.0e-5 |
| 700 | 6.78 | — | — | 7,608 | 22.0% | 2.49 | 1.1e-4 |
| 900 | 6.90 | — | — | 7,653 | 22.2% | 2.32 | 1.4e-4 |
| 1000 | 6.93 | 7.38 | 1607.6 | 7,676 | 22.2% | 3.04 | 1.5e-4 |
| 1800 | 6.71 | — | — | 6,977 | 20.2% | 3.12 | 2.7e-4 |
| 1900 | 6.50 | — | — | 6,974 | 20.2% | 2.01 | 2.9e-4 |
| 2000 | 6.36 | 7.19 | 1331.7 | 6,972 | 20.2% | 2.85 | 3.0e-4 |
| 2200 | 7.63 | — | — | 6,807 | 19.7% | 2.44 | 3.0e-4 |
| 2500 | 6.84 | — | — | 6,824 | 19.8% | 3.04 | 3.0e-4 |
| 3000 | 7.24 | 7.20 | 1341.2 | 6,783 | 19.6% | 2.17 | 3.0e-4 |
| 3500 | 6.54 | — | — | 6,681 | 19.3% | 2.62 | 3.0e-4 |
| 4000 | 7.85 | 7.10 | 1208.7 | 6,695 | 19.4% | 1.53 | 3.0e-4 |
| 4500 | 7.28 | — | — | 6,609 | 19.1% | 2.10 | 3.0e-4 |
| 5000 | 6.98 | 7.13 | 1244.0 | 6,632 | 19.2% | 1.83 | 3.0e-4 |
| 5500 | 6.49 | — | — | 6,565 | 19.0% | 1.65 | 3.0e-4 |
| 6000 | 7.16 | 7.05 | 1157.3 | 6,586 | 19.1% | 2.13 | 3.0e-4 |
| 7000 | 7.44 | 6.99 | 1084.9 | 6,586 | 19.1% | 1.19 | 3.0e-4 |
| 8000 | 7.14 | 7.02 | 1117.8 | 6,583 | 19.1% | 2.42 | 3.0e-4 |
| 9000 | 6.79 | 7.02 | 1114.0 | 6,561 | 19.0% | 0.89 | 3.0e-4 |
| 10000 | 6.35 | 7.07 | 1180.1 | 6,564 | 19.0% | 1.02 | 3.0e-4 |
| 12000 | 6.66 | 6.94 | 1036.7 | 6,570 | 19.0% | 0.84 | 3.0e-4 |
| 14000 | 6.48 | 6.93 | 1026.8 | 6,567 | 19.0% | 0.78 | 3.0e-4 |
| 16000 | 6.88 | 6.94 | 1036.4 | 6,578 | 19.0% | 0.37 | 3.0e-4 |
| 18000 | 6.56 | 6.96 | 1051.0 | 6,595 | 19.1% | 0.44 | 3.0e-4 |
| 20000 | 7.15 | 6.93 | 1023.1 | 6,621 | 19.2% | 0.36 | 3.0e-4 |
| 22000 | 6.77 | 6.92 | 1012.7 | 6,632 | 19.2% | 0.32 | 3.0e-4 |
| 24000 | 6.83 | 6.92 | 1010.5 | 6,651 | 19.3% | 0.22 | 3.0e-4 |
| 26000 | 6.61 | 6.91 | 1000.3 | 6,682 | 19.3% | 0.15 | 3.0e-4 |
Steady-state performance (steps 100-2000 warmup average):
- 7,600 tok/s ± 200 (during warmup, steps 100-1000)
- 22.1% MFU vs FP32 peak (RTX 4090, 82.6 TFLOP/s)
- 516 ms/step (p50, warmup phase)
Post-warmup performance (steps 2000-26000, constant lr):
- 6,630 tok/s ± 80 (steady state)
- 19.2% MFU (post-warmup average)
- ~560 ms/step (p50)
- VRAM: 11.4 GB / 24 GB (47% utilization)
- 0 NaN in 26,400 steps (ALB-077 fix verified)
Checkpoints (every 1000 steps, 1520 MB SafeTensors each):
- step-1000 through step-26000 — all verified OK (26 checkpoints total).
Training dynamics:
- Loss converges from 10.4 to ~6.9 in 1000 steps (during warmup)
- Post-warmup spike at step 2200 (loss=7.63) — lr reached max (3e-4), recovered by step 2500
- Val loss improving: 7.38 → 7.05 → 6.94 → 6.93 → 6.92 → 6.91 (plateau since step 12K)
- Val PPL: 1608 → 1157 → 1037 → 1027 → 1013 → 1000 (slow convergence, nearing floor)
- Gradient norm collapse: 3.04 (step 1K) → 1.02 (10K) → 0.15 (26K) — 20x decrease
- Expected for well-initialized transformers as loss landscape flattens
- ZClip spikes infrequent post-15K (z≤3.4, ema=0.14)
- B_noise decreasing: 0.22 → 0.08 (gradient signal/noise ratio improving)
Token efficiency: 108M tokens seen at step 26K. Val PPL=1000 at 108M tokens. Reference: codeparrot-small (110M) achieved val_loss ~3.5 after 50B tokens. The 350M model is undertrained — 108M tokens is <1% of typical training budget.
ETA: 250K steps × 0.56s = 38.9 hours (~1.6 days from start). At step 26K: ~10.4% complete, ~34.5 hours remaining. Compare: PTX baseline would be 250K × 4.4s = 12.7 days.
6.14 Stream Sync Bottleneck Analysis (ALB-078, Five Whys)
Observation: v3 training at step 1500 shows step time increased to 618ms (from 516ms at step 1000). The difference correlates with gradient clipping becoming active as gnorm grows.
Five Whys:
- Why 618ms/step? Per-block gradient clipping introduces stream syncs
- Why per-block syncs?
compute_workspace_clip_scale_gpucallsstream.synchronize()after launching 9squared_sumkernels per block - Why sync needed? CPU must download 9 partial-sum buffers to compute
clip_scale = min(1, max_norm / sqrt(sum_of_squared_norms)) - Why CPU-side? No fused GPU kernel exists for norm reduction + clip
- Why 24 syncs? One per transformer block (interleaved backward+optimizer)
Sync budget (per step, with grad_clip: 1.0):
| Sync Point | Count/step | Location | Necessary? |
|---|---|---|---|
| Per-block clip norm | 24 | compute_workspace_clip_scale_gpu | REDUNDANT |
| LM head norm | 1 | squared_sum_cuda | REDUNDANT |
| Final global norm | 1 | compute_clip_scale_with_norm | REDUNDANT |
| CE loss D2H | 1 | fused_cross_entropy_cuda | YES (NaN guard) |
| Pre-embed sync | 1 | gpu_backward:1134 | YES (C-STREAMSYNC-001) |
| Total | 28 | 2 necessary, 26 redundant |
Fix (entrenar #240, trueno #171) — IMPLEMENTED:
Two new PTX kernels in trueno-gpu/src/kernels/optimizer/fused_clip.rs:
-
ClipScaleReduceKernel: Single-CTA, single-thread. Reads contiguousf32[total_partials]buffer of squared-sum partial results, computesclip_scale = min(1.0, max_norm / sqrt(sum)). IEEE 754 handles zero-norm without branching (div(x, 0.0) = +inf,min(+inf, 1.0) = 1.0). Writesoutput[0] = scale, output[1] = normfor observability. -
GradientClipGpuScaleKernel: Element-wise. Reads scale from GPU pointer (not host param). Early exit whenscale ≈ 1.0(within 1e-7) to avoid unnecessary memory bandwidth when no clipping needed.
Integration in entrenar/src/autograd/cuda_optim.rs:
FusedClipState: Pre-allocated contiguous partials buffer + scale buffersquared_sum_launch_into: Writes partial sums at offset into contiguous bufferclip_scale_reduce_cuda: Launches ClipScaleReduceKernel (grid 1×1, block 1×1)gradient_clip_gpu_scale_cuda: Launches GradientClipGpuScaleKernel
Pipeline (per block): 9× squared_sum_launch_into → 1× clip_scale_reduce → 9× gradient_clip_gpu_scale. Zero sync points, zero D2H transfers.
This eliminates 26 of 28 syncs/step. The 2 remaining are irreducible:
- CE loss download for NaN guard
- Final sync before embed gradient D2H (C-STREAMSYNC-001)
Status: Implemented, compiles, awaiting dogfood on next training restart. Expected impact: step time 618ms → ~500ms (~20% improvement).
6.15 Training Quality Analysis (ALB-079/080, Five Whys)
Observation: v3 training at step 26K shows val_loss plateau at 6.92 (val_ppl=1000) since step 12K. Gradient norm collapsed from 3.04 (step 1K) to 0.15 (step 26K) — 20x decrease while lr is at peak (3e-4).
Five Whys — Root Cause 1: Missing Cosine LR Decay (ALB-079)
- Why constant lr=3e-4 at all steps?
CudaTransformerTrainer::current_lr()only implemented linear warmup; returnedbase_lrafter warmup (line 1938) - Why no cosine?
TransformerTrainConfighas nolr_schedulerfield; YAML config parsed by bridge but not propagated to CUDA path - Why not caught earlier? At step 2K-5K, cosine barely differs from constant (lr ≈ 2.99e-4 vs 3.00e-4); plateau only visible after 10K steps
- Fix (entrenar #241): Cosine decay in
current_lr()usingwarmup_stepsandmax_steps. CPU embedding optimizer synced viaset_lr().
Five Whys — Root Cause 2: Effective Batch Size 48-128x Too Small (ALB-080)
- Why val_ppl plateau at 1000? Gradient noise too high to escape loss basin
- Why noisy gradients? Effective batch = 4 × 1 × 1024 = 4,096 tokens/step
- Why 4,096?
gradient_accumulation: 1in config, VRAM limitsbatch_size: 4 - Why so small? Config was set for debugging; no Chinchilla batch size analysis
- Why does it matter? Comparable 350M models use 131K-524K tokens/step (32-128x larger)
| Model | Batch Size (tokens/step) |
|---|---|
| CodeGen-350M-mono | ~500K+ |
| CodeParrot-small (110M) | 196K |
| GPT-2 124M (nanoGPT) | ~524K |
| Albor v3 | 4,096 |
| Albor v4 (planned) | 131,072 |
Fix: pretrain-350m-v4.yaml with gradient_accumulation: 32 (131K tokens/step),
warmup_steps: 375, max_steps: 7500 (~1B tokens). Same wall-clock as v3 (same
number of forward/backward passes), dramatically better gradient quality.
Expected impact: val_ppl should break through 1000 floor and reach <100 by 1B tokens. gnorm should stabilize at 0.5-2.0 (not collapse to 0.13).
7. Verification Architecture
7.1 Four-Layer Verification
Layer 1: CONTRACTS (provable-contracts / pv)
What: Algebraic invariants, proof obligations, falsification tests
When: BEFORE implementation (write contract first)
How: pv validate, pv scaffold, pv audit
Files: contracts/cublas-gemm-v1.yaml
contracts/training-step-budget-v1.yaml
Layer 2: BENCHMARKS (raw C ceiling + Criterion + regression detection)
What: Three-tier GEMM comparison with hardware ceiling
When: BEFORE (ceiling), DURING (Criterion), AFTER (regression)
How: make bench-gemm-compare, make bench-gemm-regression
Pattern: Raw C cuBLAS (ceiling) vs Rust cuBLAS (target) vs PTX (floor)
- FFI overhead < 2% (Rust vs Raw C)
- Speedup > 10x (cuBLAS vs PTX)
- Regression < 10% per shape between commits
- Follows trueno/benchmarks/ matmul_comparison.py pattern exactly
Layer 3: BRICK PROFILING (probador)
What: Per-component time budgets with Jidoka gates
When: DURING implementation (continuous enforcement)
How: BrickHouse builder, brick assertions, budget_ms
Pattern: Each training loop component = one Brick with:
- can_render() = Jidoka gate (fail if > 2x budget)
- verify() = timing assertion
- budget_ms = SLA from contract
Layer 4: LAYER TRACING (renacer BrickTracer)
What: Per-kernel, per-block, per-transfer timing with OTLP export
When: DURING profiling runs + AFTER implementation (regression detection)
How: BrickTracer.trace(), OTLP -> Jaeger, anomaly escalation
Pattern: Each CUDA kernel call = one trace span
- Forward: block_N_gemm_qkv, block_N_attention, block_N_ffn
- Backward: block_N_backward_gemm, block_N_backward_elementwise
- Transfer: pcie_h2d_embed, pcie_d2h_logits, pcie_h2d_grad
- Optimizer: block_N_optimizer_d2h, block_N_adamw, block_N_optimizer_h2d
7.2 Escalation Chain
Renacer implements automatic escalation from lightweight metrics to detailed tracing:
Steady state (metrics only):
- Counter: gemm_calls_total, pcie_bytes_total
- Gauge: step_time_ms, mfu_ratio
- Histogram: per_block_forward_us, per_block_backward_us
Escalation trigger (CV > 15% or efficiency < 25%):
- BrickTracer captures full syscall breakdown
- OTLP spans exported to Jaeger with per-kernel detail
- Anomaly detector flags the brick and step number
Alert (budget violation > 2x):
- Jidoka gate fires (probador)
- Training loop pauses (Andon alert)
- Full trace exported for post-mortem
This means training runs at full speed in steady state (metrics are SIMD- accelerated via trueno), and only pays the tracing cost when something goes wrong.
7.3 Continuous Verification During Training
# Run training with BrickTracer instrumentation
RUST_LOG=info renacer --otlp-endpoint http://localhost:4317 \
--otlp-service-name "albor-v3-cublas" \
--trace-compute \
--trace-compute-threshold 100 \
-- apr train apply --task pretrain \
--config configs/train/pretrain-350m-v3.yaml
# In another terminal: monitor brick budgets
apr monitor ./checkpoints/albor-base-350m-v3/
# Post-run: audit contract compliance
pv audit contracts/cublas-gemm-v1.yaml \
--binding contracts/trueno-gpu/cublas-binding.yaml
pv audit contracts/training-step-budget-v1.yaml \
--binding contracts/entrenar/step-budget-binding.yaml
# Post-run: view traces in Jaeger
# http://localhost:16686 -> Service: "albor-v3-cublas"
# Filter by: operation="gemm_forward", minDuration=10ms
8. Risks
| Risk | Mitigation | Contract Obligation |
|---|---|---|
| cuBLAS FP16 numerical divergence | Keep FP32 master weights, compare loss curves | FALSIFY-CUBLAS-002 |
| libcublas.so version mismatch | Pin to CUDA 12.x, test on lambda machine | FALSIFY-CUBLAS-003 |
| cuBLAS workspace memory pressure | Pre-allocate fixed workspace, share across GEMMs | training-memory-kernel-v1 |
| CPU optimizer becomes new bottleneck | Phase 4 contract (gpu-optimizer-v1) | FALSIFY-BUDGET-002 |
| Tensor core shapes require padding | Albor shapes (1024, 4096, 32768) already multiples of 8 | FALSIFY-CUBLAS-003 |
| FP16 weight precision loss | Standard practice; master weights remain FP32 on CPU | FALSIFY-CUBLAS-002 |
| Silent regression after optimization | Brick budgets + Jidoka gates detect immediately | FALSIFY-BUDGET-003 |
| Unaccounted overhead hiding bottleneck | Brick coverage >= 95% of step time enforced | FALSIFY-BUDGET-001 |
9. Dependencies
libcublas.sofrom CUDA toolkit (already installed:/usr/local/cuda/lib64/)nvccfor compiling raw C cuBLAS benchmark (ceiling measurement)- trueno-gpu crate (target for FFI integration)
- entrenar CudaTransformerTrainer (consumer of cuBLAS GEMMs)
- renacer BrickTracer (layer tracing instrumentation)
- probador brick budgets (SLA enforcement)
- provable-contracts /
pv(contract validation and audit) - Criterion.rs (Rust benchmark harness, already a trueno dev-dependency)
- No new Rust crate dependencies (pure FFI, no bindgen)
10. Contract Registry
| Contract File | Status | Validates |
|---|---|---|
contracts/cublas-gemm-v1.yaml | NEW (write before Phase 1) | cuBLAS correctness, buffer safety, MFU improvement |
contracts/training-step-budget-v1.yaml | NEW (write before Phase 0) | Brick-level performance SLAs, Jidoka enforcement |
contracts/training-gpu-kernel-v1.yaml | EXISTING | Parent contract — PCIe transfers, stability, gradient flow |
contracts/training-memory-kernel-v1.yaml | EXISTING | VRAM budget (must update for FP16 weight storage) |
contracts/training-config-kernel-v1.yaml | EXISTING | Epoch/step/LR algebraic consistency |
contracts/fused-kernels-v1.yaml | NEW (write before Phase 4) | Fused CE, RMS norm reuse, SwiGLU in-place, fused attention |
contracts/gpu-optimizer-v1.yaml | FUTURE (Phase 4) | GPU-resident AdamW correctness |
contracts/gpu-embedding-v1.yaml | FUTURE (Phase 5) | GPU embedding lookup + scatter-add |
contracts/async-pipeline-v1.yaml | FUTURE (Phase 6) | Compute/transfer overlap safety |
contracts/grad-checkpoint-v1.yaml | FUTURE (Phase 7) | Gradient checkpointing memory/correctness |
11. Unsloth-Inspired Kernel Optimizations
Source: Analysis of unslothai/unsloth (cloned 2026-03-05). Unsloth achieves 2x training speedup over HuggingFace via fused Triton kernels, selective activation saving, and in-place backward ops. These patterns translate to our Rust + CUDA PTX stack.
11.1 Fused Cross-Entropy Loss + Backward
What unsloth does: Single Triton kernel computes logsumexp, loss, and
dL/dx (softmax - one_hot) in one pass. Never materializes the full probability
distribution.
Current albor: Separate kernels for logits→softmax, softmax→loss, loss→grad.
For vocab=32K, batch=4, seq=1024, the logit tensor is [4096, 32768] = 512 MB
in FP32. Three kernel launches + three full reads/writes of this tensor.
Proposed change: Fused CE kernel that:
- Computes
logsumexpper row (FP32 accumulation for stability) - Computes
loss = logsumexp - logit[label]per row - Computes
grad[i] = exp(logit[i] - logsumexp) - delta(i, label)in-place - Never allocates full softmax tensor
Expected gain: -2 kernel launches, -1 GB memory bandwidth per step. Step time: ~20-40ms savings (CE is ~1% of step time, but memory bandwidth relief helps other kernels via improved cache pressure).
Contract: contracts/fused-kernels-v1.yaml — FALSIFY-FUSED-001
Equations:
fused_ce_correctness:
loss_fused = -logit[label] + log(sum(exp(logit[i]))) for each row
grad_fused[i] = exp(logit[i] - logsumexp) - delta(i, label)
Invariant: max_abs_diff(loss_fused, loss_separate) < 1e-5
Invariant: max_abs_diff(grad_fused, grad_separate) < 1e-5
Invariant: FP32 accumulation for logsumexp (no FP16 overflow on 32K vocab)
11.2 Activation Memory Reuse (RMS LayerNorm)
What unsloth does: RMS LayerNorm forward saves ONLY inv_var (1 scalar per
row = batch * seq_len floats). Backward recomputes normed = X * inv_var from
the activation cache. Total saved: O(B*S) instead of O(B*S*H).
Current albor: Saves X, W, inv_var, and normed per layer during
forward for use in backward. For 24 layers × [4096, 1024]:
X: 24 × 16 MB = 384 MBnormed: 24 × 16 MB = 384 MBinv_var: 24 × 16 KB = 384 KB (negligible)- Total saved: 768 MB of activation memory
Proposed change: Save only inv_var per layer. During RMS norm backward:
- Recompute
normed = X_cached * inv_var(X is available from the previous layer’s output or the activation cache) - Compute
d_weight = sum(grad_output * normed) - Compute
d_input = (grad_output * W - normed * d_weight_sum) * inv_var
Expected gain: -384 MB activation memory (normed tensor eliminated). This is 3.2% of 24 GB VRAM — modest alone, but compounds with other savings to potentially enable batch=8 without gradient checkpointing.
Contract: contracts/fused-kernels-v1.yaml — FALSIFY-FUSED-002
Equations:
rmsnorm_recompute_correctness:
normed_recomputed = X * inv_var_saved
max_abs_diff(normed_recomputed, normed_original) == 0.0 (exact, same FP32)
Memory reduction:
activation_memory(optimized) = activation_memory(current) - 24 * B * S * H * 4 bytes
For B=4, S=1024, H=1024: savings = 24 * 4 * 1024 * 1024 * 4 = 402,653,184 bytes (~384 MB)
11.3 SwiGLU In-Place Backward
What unsloth does: GEGLU/SwiGLU backward overwrites input buffers with
gradient results. Forward: h = silu(e) * g. Backward stores dh, de, dg
into the same memory as h, e, g. No new allocations.
Current albor: CudaGradWorkspace reuses buffers per-block (already good),
but within a block, SwiGLU backward allocates separate grad_gate, grad_up,
and grad_down buffers. For intermediate_size=4096:
grad_gate:[4096, 4096]= 64 MBgrad_up:[4096, 4096]= 64 MB- Total per-block overhead: 128 MB (shared workspace, so only peak matters)
Proposed change: Fuse SwiGLU backward to overwrite gate/up buffers in-place:
d_gate = grad_output * up * silu_deriv(gate)→ store ingatebufferd_up = grad_output * silu(gate)→ store inupbuffer- No separate allocation for d_gate, d_up
Expected gain: -128 MB peak workspace per block (already shared, so reduces peak VRAM, not total allocations). Main benefit is reduced memory bandwidth — fewer buffer copies between kernels.
Contract: contracts/fused-kernels-v1.yaml — FALSIFY-FUSED-003
Equations:
swiglu_inplace_correctness:
d_gate_inplace = grad_out * up * sigmoid(gate) * (1 + gate * (1 - sigmoid(gate)))
d_up_inplace = grad_out * silu(gate)
max_abs_diff(d_gate_inplace, d_gate_separate) < 1e-5
max_abs_diff(d_up_inplace, d_up_separate) < 1e-5
11.4 Mixed Precision Discipline (Validated)
What unsloth does: Loads activations as FP32 for critical arithmetic (variance, softmax, logsumexp), keeps weights in BF16, casts output back after critical ops.
Albor status: Already implemented correctly (validated by ALB-072 fix). Our backward is all FP32, master weights are FP32 on CPU, forward weights are FP32 on GPU (will become FP16 with cuBLAS). This matches unsloth’s pattern.
Action: No code change needed. Document as validation that our approach matches production-grade mixed precision practice.
11.5 RoPE Head Grouping
What unsloth does: Applies RoPE to 4 heads simultaneously, loading sin/cos
once and reusing across the group. ROPE_GROUP_SIZE = 4.
Current albor: Per-head RoPE application in the attention forward kernel. Sin/cos recomputed or reloaded per head.
Proposed change: Batch RoPE across all Q heads (16) and KV heads (4) with single sin/cos load. For our GQA architecture (16 Q heads, 4 KV heads):
- Q: load sin/cos once, apply to 16 heads
- K: same sin/cos, apply to 4 heads
- V: no RoPE (not rotated)
Expected gain: ~10% attention kernel speedup from better L2 cache utilization. Small absolute impact (~5-10ms/step) since RoPE is not compute-dominant.
Contract: contracts/fused-kernels-v1.yaml — FALSIFY-FUSED-004
Equations:
rope_grouped_correctness:
For each head h in [0, n_heads):
Q_rotated_grouped[h] == Q_rotated_individual[h] (bit-exact)
Performance: T_rope(grouped) < 0.9 * T_rope(individual)
11.6 Fused Attention (QK^T → Softmax → V)
What unsloth does: Uses Flash Attention or Flex Attention to fuse the
3-step attention computation into a single kernel. Never materializes the
full [seq, seq] attention score matrix.
Current albor: Three separate operations per attention head:
scores = Q @ K^T→ cuBLAS GEMM →[4096, 1024](with cuBLAS)probs = softmax(scores / sqrt(d_k))→ elementwise kerneloutput = probs @ V→ cuBLAS GEMM
This materializes the [batch, heads, seq, seq] = [4, 16, 1024, 1024] = 256 MB
attention score tensor. For 24 layers, that’s 6.1 GB if all layers’ scores
are live simultaneously (they aren’t in our per-block architecture, but the
per-block peak still includes this).
Proposed change: Custom fused attention kernel (not Flash Attention — our seq=1024 is short enough that tiled online softmax gives most of the benefit):
- Tile Q, K, V into blocks (e.g., 64×64)
- Compute
QK^Ttile, apply causal mask, running softmax (online algorithm) - Accumulate
softmax(tile) @ Vwithout materializing full score matrix - Output: attention result directly, save only logsumexp for backward
Expected gain:
- -256 MB peak VRAM per block (attention scores not materialized)
- -2 kernel launches per layer (3→1)
- ~15% attention speedup from reduced memory bandwidth
- Enables batch=8 by freeing VRAM headroom
Contract: contracts/fused-kernels-v1.yaml — FALSIFY-FUSED-005
Equations:
fused_attention_correctness:
output_fused = softmax(Q @ K^T / sqrt(d_k) + causal_mask) @ V
max_abs_diff(output_fused, output_separate) < 1e-3 (FP32)
max_abs_diff(output_fused, output_separate) < 1e-2 (FP16)
Memory:
peak_attn_memory(fused) < peak_attn_memory(separate) / 4
# Separate: [B, H, S, S] = 256 MB
# Fused: [B, H, tile, tile] = 256 MB / (S/tile)^2
11.7 Chunked Cross-Entropy for Future Vocab Scaling
What unsloth does: For vocab > 65K, splits logsumexp computation into chunks
of 65536. Mathematical property: logsumexp(chunked_logsumexp) == logsumexp(full).
Current albor: Vocab = 32K, fits in single chunk. Not needed now.
Future applicability: If we scale to multi-lingual (65K+ vocab) or adopt a larger tokenizer, chunked CE prevents register pressure overflow in the fused CE kernel. The logsumexp decomposition is:
logsumexp([a, b]) = max(a, b) + log(exp(a - max) + exp(b - max))
Each chunk computes a partial logsumexp. The final logsumexp combines partials. This is numerically stable and mathematically exact.
Contract: Deferred until vocab > 65K. Will be added to fused-kernels-v1.yaml
if tokenizer v3 exceeds 65K vocabulary.
11.8 Gradient Checkpointing (Activation Recomputation)
What unsloth does: Trades compute for memory by recomputing layer activations during backward instead of saving them during forward. 2x slower backward, but ~3x smaller activation memory.
Current albor: Per-block interleaved backward+optimizer design already limits peak activation memory to one block’s worth. But with fused attention (§11.6) and activation reuse (§11.2), we may not need gradient checkpointing for batch=4.
When needed: If batch=8 + seq=2048 still OOMs after §11.2 + §11.6.
Contract: contracts/grad-checkpoint-v1.yaml (FUTURE — already in registry)
Equations:
checkpoint_correctness:
grad(checkpointed) == grad(full_save) # Bit-exact: same computation
Memory:
peak_activation(checkpointed) = peak_activation(full) / num_checkpoint_segments
Performance:
T_backward(checkpointed) < 2.0 * T_backward(full) # At most 2x slower
11.9 Summary: Optimization Priority Matrix
| # | Optimization | Expected Gain | Memory Savings | Effort | Phase |
|---|---|---|---|---|---|
| 1 | cuBLAS tensor core GEMMs | 50x GEMM, 2x step | 0 | High | 1-3 |
| 2 | Fused CE loss + backward | 20-40ms/step | -512 MB bandwidth | Medium | 4 |
| 3 | RMS norm activation reuse | 0 (compute) | -384 MB | Low | 4 |
| 4 | SwiGLU in-place backward | 10-20ms/step | -128 MB peak | Low | 4 |
| 5 | RoPE head grouping | 5-10ms/step | 0 | Low | 4 |
| 6 | Fused attention (tiled) | 15% attn speedup | -256 MB/layer | High | 5 |
| 7 | Chunked CE (vocab >65K) | 0 (future) | 0 | Low | Deferred |
| 8 | Gradient checkpointing | -2x backward | -66% activations | Medium | 7 |
Cumulative impact (Phases 1-5b, measured):
- Step time: 4,400ms → 444ms (9.9x; cuBLAS SIMD 5.9x, batched RMSNorm 24.8x fwd)
- MFU: 2.5% → 26.7% (vs FP32 peak, runtime-reported)
- Tok/s: 934 → 9,216 (9.9x improvement)
- Note: Tensor cores disabled (ALB-076, §6.12) — produce NaN in transposed backward GEMMs
11.10 Falsification Tests for Kernel Optimizations
| ID | Rule | Prediction | Contract |
|---|---|---|---|
| FALSIFY-FUSED-001 | Fused CE matches separate CE | max_abs_diff(loss) < 1e-5 on 50M model, 50 steps | fused-kernels-v1 |
| FALSIFY-FUSED-002 | RMS norm recompute is bit-exact | normed_recomputed == normed_original (FP32, exact) | fused-kernels-v1 |
| FALSIFY-FUSED-003 | SwiGLU in-place backward correct | max_abs_diff(d_gate, d_gate_ref) < 1e-5 | fused-kernels-v1 |
| FALSIFY-FUSED-004 | RoPE grouped matches individual | Bit-exact Q_rotated for all 16 heads | fused-kernels-v1 |
| FALSIFY-FUSED-005 | Fused attention matches separate | max_abs_diff(output) < 1e-3 (FP32) | fused-kernels-v1 |
| FALSIFY-FUSED-006 | Memory savings measured | Activation peak reduced by >= 300 MB | fused-kernels-v1 |
| FALSIFY-FUSED-007 | Fused CE never materializes softmax | Peak memory during CE < B*S*V*4 bytes | fused-kernels-v1 |
| FALSIFY-FUSED-008 | Gradient checkpointing bit-exact | grad(checkpointed) == grad(full) for all params | grad-checkpoint-v1 |
| FALSIFY-FUSED-009 | Fused attention backward correct | All params get gradients, loss within 1% of separate | fused-kernels-v1 |
| FALSIFY-FUSED-010 | No training instability from fusions | 100-step run: loss.is_finite() every step, gnorm < 100 | fused-kernels-v1 |
Appendix A: Popperian Falsification of This Specification
Date: 2026-03-05
Method: batuta falsify . (108-item checklist) + manual chain-of-thought
analysis of every claim, equation, and assumption in this spec.
Batuta project score: 80.1% (Andon Warning), 65 PASS, 0 FAIL, 43 PARTIAL. Key findings from batuta mapped to spec weaknesses below.
A.1 Chain-of-Thought Falsification
Each numbered item is a falsifiable claim from the spec, followed by the attempt to break it.
Claim 1: “Step time is 4,400ms with 57% in GEMM” (Section 2.4)
- Status: UNVERIFIED ESTIMATE. The breakdown is labeled “Estimated” but no profiling data backs it. The spec prescribes renacer BrickTracer profiling in Phase 0, but Phase 0 hasn’t run yet. The 57% GEMM figure is a guess.
- Risk: If GEMM is actually 30% of step time (e.g., CPU optimizer is 40%), cuBLAS integration yields only 1.3x speedup instead of 2x.
- Action: Phase 0 is blocking. Do not proceed to Phase 1 until BrickTracer confirms the breakdown. Add a contract obligation: FALSIFY-BASELINE-001.
Claim 2: “cuBLAS achieves 130-150 TFLOP/s on Albor shapes” (Section 4.1)
- Status: VERIFIED. Measured 152.3 TFLOP/s on FFN gate/up shape
[4096, 1024] x [1024, 4096], 141.2 TFLOP/s on FFN down, 89.4 TFLOP/s on square[1024, 1024]. The range 89-152 TFLOP/s matches or exceeds the 130-150 prediction for large shapes. Smaller square shapes are memory-bandwidth bound as expected. - Verification: trueno-gpu cuBLAS hardware tests (PR #165).
Claim 3: “FFI overhead < 2%” (Section 5.7, FALSIFY-CUBLAS-008)
- Status: PLAUSIBLE but untested. cuBLAS FFI is a single function call with no data copies (pointers passed through). 2% overhead is reasonable.
- Risk: If
CublasHandle::set_stream()is called per-GEMM (555 calls/step) rather than once per step, the cumulative overhead could exceed 2%. - Action: The wrapper should call
set_stream()once at step start, not per-GEMM. Add this as a contract invariant.
Claim 4: “MFU = 2.5% vs FP32 peak” (Section 1.2)
- Status: PARTIALLY FALSIFIED. The MFU formula uses
6 * P * tokens_per_stepbut this approximation assumes all FLOPs are in GEMMs. For a 370M model with batch=4, seq=1024, the attention score computation (QK^T) adds2 * S^2 * H * L = 2 * 1024^2 * 1024 * 24 = 51.5 GFLOPper step, which is <1% of the 9.1 TFLOP total. The 6x approximation is valid here. - Correction: MFU is correct to within ~1% of the true value. No action needed.
Claim 5: “Step time drops to 2,150ms after cuBLAS” (Section 6.1)
- Status: MEASURED — 1,379 ms (better than projected). The original
projection of 2,150ms assumed non-GEMM time stays constant at 1,900ms.
Actual measurement showed 1,379ms (seq=512, batch=4), which is 36% better
than projected. Verified via dogfooding:
apr train applywith cuBLAS (entrenar PR #233), 1,485 tok/s, 4.3% MFU. - FALSIFY-CUBLAS-009 still relevant: verify non-GEMM time decomposition.
Claim 6: “555 GEMM operations per step” (Section 2.1)
- Status: APPROXIMATELY CORRECT but undercounted. The count includes
attention score GEMMs (QK^T) but omits attention value application (V
projection after softmax), which is also a GEMM:
softmax(QK^T) * V. Forward: 24 blocks x 1 = 24. Backward: 24 blocks x 2 = 48. Plus attention backward for the score GEMM itself. - Correction: The actual count may be ~600 GEMMs, not 555. The difference is small (<10%) and doesn’t change the analysis materially, but the spec should note the approximation.
Claim 7: “Phase 7 achieves 17.5% MFU with batch=8” (Section 6.3)
- Status: CONTRADICTS KNOWN CONSTRAINT. Section 4.3 of the spec notes seq=1024, batch=8 currently OOMs. Phase 7 lists this as requiring gradient checkpointing, but with cuBLAS adding FP16 weight copies alongside FP32 master weights, VRAM pressure increases. The 650ms step time assumes batch=8 fits, which is unproven.
- Risk: batch=8 may still OOM even with gradient checkpointing if FP16+FP32 dual weight storage consumes the headroom.
- Action: Add VRAM budget equation to training-memory-kernel-v1.yaml for mixed-precision dual storage. FALSIFY-MEM-004: “batch=8 fits in 24GB with FP16 forward weights + FP32 master weights + gradient checkpointing.”
Claim 8: “Benchmark shapes are representative” (Section 5.2)
- Status: INCOMPLETE. The 6 benchmark shapes cover the large GEMMs but
omit the GQA key-value projection shapes:
[4096, 256, 1024](K and V projections with num_kv_heads=4, head_dim=64, so kv_dim=256). These are small, thin matrices where cuBLAS may show less speedup due to low arithmetic intensity. - Action: Add
(4096, 256, 1024, "attn_kv")to SHAPES in both C and Criterion benchmarks. This is the worst-case shape for tensor cores.
Claim 9: “Performance regression gate at 10%” (Section 5.5)
- Status: MATCHES batuta JA-04 finding. Batuta flagged JA-04 (Performance
Regression Gate) as PARTIAL with rejection “Benchmarks exist but not gated
in CI.” The spec defines
make bench-gemm-regressionbut does not integrate it into CI. - Action: Add
bench-gemm-regressionto theclean-room / gateCI workflow for trueno-gpu. This addresses JA-04.
Claim 10: “No new Rust crate dependencies” (Section 9)
- Status: CORRECT. Pure FFI bindings require only
libctypes (already in std) andlibcublas.so(system library). Nocublas-sysorbindgencrate needed. - Verified: This is consistent with trueno’s existing pattern of hand-written CUDA driver API bindings.
A.2 Batuta Findings Mapped to Spec
| Batuta ID | Status | Spec Impact |
|---|---|---|
| JA-04 | PARTIAL: “Benchmarks not gated in CI” | Section 5: Add bench-gemm-regression to CI |
| PW-02 | PARTIAL: “No SIMD optimization” | N/A (spec is about GPU, not CPU SIMD) |
| EDD-01 | PARTIAL: “Partial equation documentation” | Section 3.1: Ensure all contract equations have domain/codomain/invariants |
| EDD-03 | PARTIAL: “Numerical code without analytical validation” | Section 5.2: Raw C baseline IS the analytical validation |
| NR-01 | PARTIAL: “No explicit IEEE 754 testing” | Add: cuBLAS FP32 accumulation contract (C-CUBLAS-004) covers this |
| NR-02 | PARTIAL: “Single platform testing” | N/A (CUDA-only by design, RTX 4090 target) |
| AI-01 | PARTIAL: “Config examples incomplete” | Add cuBLAS config example to YAML configs |
| AI-05 | PARTIAL: “No explicit validator” | apr train validate already validates; extend for cuBLAS feature |
A.3 Missing Falsification Tests (Discovered by Chain-of-Thought)
The following tests are NOT in the current contract but SHOULD be:
# Add to cublas-gemm-v1.yaml
- id: FALSIFY-CUBLAS-009
rule: "Non-GEMM overhead does not increase after cuBLAS"
prediction: "T_non_gemm(cublas) < 1.1 * T_non_gemm(ptx)"
test: |
Profile 50 steps with PTX, measure total non-GEMM time.
Profile 50 steps with cuBLAS, measure total non-GEMM time.
Ratio must be < 1.10.
if_fails: "FP16 casting, handle creation, or workspace allocation adds overhead"
- id: FALSIFY-CUBLAS-010
rule: "GQA thin-matrix GEMM still benefits from cuBLAS"
prediction: "cuBLAS [4096, 256, 1024] > 50 TFLOP/s"
test: |
Run isolated GEMM on K/V projection shape [4096, 256, 1024].
Must exceed 50 TFLOP/s (lower bar than large shapes due to
low arithmetic intensity).
if_fails: "Thin matrices memory-bandwidth-bound, not compute-bound"
- id: FALSIFY-CUBLAS-011
rule: "cuBLAS column-major convention handled correctly"
prediction: "Row-major Rust buffers produce correct results via transpose flags"
test: |
Compute C = A * B in row-major (Rust native) using cuBLAS with
appropriate CUBLAS_OP_T flags. Compare against known-good reference.
All 7 GEMM shapes in a single transformer block must match.
if_fails: "Leading dimension or transpose convention wrong (ALB-059 class bug)"
# Add to training-step-budget-v1.yaml
- id: FALSIFY-BUDGET-004
rule: "Phase 0 baseline matches estimated breakdown"
prediction: "Measured GEMM fraction is 50-65% of step time"
test: |
Run BrickTracer profiling for 50 steps on PTX backend.
T_gemm / T_step must be in [0.50, 0.65].
if_fails: "Estimated breakdown is wrong; re-derive all phase projections"
# Add to training-memory-kernel-v1.yaml
- id: FALSIFY-MEM-004
rule: "Mixed-precision dual storage fits in VRAM"
prediction: "FP16 forward weights + FP32 master weights + optimizer < 24GB"
test: |
Compute: P * 2 (FP16 GPU) + P * 4 (FP32 CPU master, not on GPU)
+ P * 8 (AdamW m+v, on GPU) + workspace.
P=370M: 0.74 GB (FP16) + 2.96 GB (AdamW) + workspace = ~15.5 GB.
Must fit in 24 GB with seq=1024, batch=4.
if_fails: "VRAM budget exceeded, batch=4 may OOM with mixed precision"
Claim 11: “TF32 tensor cores provide ~2x throughput” (Section 6.9, Phase 5a)
- Status: FALSIFIED — REVERTED (ALB-076). TF32 tensor cores showed 0%
improvement at 350M model size (§6.9). More critically, tensor core GEMM
algorithms (
CUBLAS_GEMM_DEFAULT_TENSOR_OP) produce ALL NaN output for transposed backward GEMMs when gradient magnitudes reach ~1e5 (§6.12). - Root cause: cuBLAS tensor core algorithm has undocumented numerical failure mode with transposed operands at high magnitudes. Forward (NoTrans/NoTrans) is unaffected.
- Fix: Disabled tensor cores entirely (
CUBLAS_DEFAULT_MATH). cuBLAS SIMD path still 5.9x faster than PTX. Phase 5a reverted (trueno #170). - Action: Phase 5a removed from optimization path. Added to bug pattern catalog.
A.4 Unrealistic Assumptions Identified
| Assumption | Section | Reality Check |
|---|---|---|
| GEMM is 57% of step time | 2.4 | Unverified estimate. Phase 0 must confirm. |
| cuBLAS achieves 130-150 TFLOP/s | 4.1 | Depends on shape. May be 80-120 on rectangular. |
| Non-GEMM time stays constant | 6.1 | FP16 casting adds new overhead. |
| 2% FFI overhead | 5.7 | Plausible but requires per-GEMM vs per-step stream binding. |
| batch=8 fits with grad ckpt | 6.3 | Dual precision increases VRAM. Unproven. |
| 165 TFLOP/s is achievable peak | 1.2 | Marketing spec. Sustained is ~145-150 TFLOP/s. |
A.5 Recommended Spec Revisions
- Gate Phase 1 on Phase 0 completion. Do not write cuBLAS code until BrickTracer confirms the estimated breakdown.
- Add GQA thin-matrix shape
[4096, 256, 1024]to all benchmarks. - Add FALSIFY-CUBLAS-009 (non-GEMM overhead preservation).
- Add FALSIFY-CUBLAS-010 (thin-matrix performance floor).
- Add FALSIFY-CUBLAS-011 (column-major convention correctness).
- Add FALSIFY-BUDGET-004 (baseline confirmation gate).
- Add FALSIFY-MEM-004 (mixed-precision VRAM budget).
- Integrate bench-gemm-regression into CI (addresses batuta JA-04).
- Use sustained peak (~148 TFLOP/s) instead of marketing peak (165) for MFU calculations.
- Note set_stream() binding scope in cublas.rs contract: once per step, not per GEMM.