Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.14.2] - 2026-01-25

Fixed

macOS ARM64 Support - Fixed conditional compilation for cross-platform builds
- BLIS microkernels (AVX2/FMA) now properly gated with #[cfg(target_arch = "x86_64")]
- Q4K GEMV parallel function now properly gated for x86_64 only
- Fixes build failures on macOS ARM64 (aarch64-apple-darwin)

[0.14.1] - 2026-01-25

Quality

95% Velocity Mandate - Achieved 95%+ coverage on ALL individual files
- trueno: 98.40% overall coverage (2421 tests)
- trueno-gpu: 97.98% overall coverage (1873 tests)
- No file below 95% threshold

Ecosystem Updates

All trueno ecosystem crates updated to use trueno 0.14:

trueno-db v0.3.12
trueno-graph v0.1.12
trueno-rag v0.1.11
trueno-viz v0.1.21
trueno-explain v0.2.2 (trueno-gpu v0.4.11)

Infrastructure

Pre-commit hooks enforce 90% coverage threshold
All 4294 tests passing (2421 trueno + 1873 trueno-gpu)

[Unreleased]

Added

LZ4 Compression Kernel - GPU-accelerated LZ4 compression
- Lz4WarpCompressKernel: Warp-per-page architecture (32 threads per 4KB page)
- Lz4WarpDecompressKernel: Corresponding decompression kernel
- CPU reference implementation for testing (lz4_compress_block, lz4_decompress_block)
- Dual backend: NVIDIA PTX + WebGPU WGSL generation
- Zero-page detection with parallel OR reduction
- 200:1 compression ratio for zero pages, 15-30:1 for typical data

Documentation

Added LZ4 compression example (cargo run -p trueno-gpu --example lz4_compression)
Added LZ4 compression chapter to book (api-reference/lz4-compression.md)

[0.13.0] - 2026-01-16

Added

BLIS-Style Matrix Multiplication - High-performance GEMM achieving 71.5 GFLOP/s
- Hand-written ASM microkernel with 70%+ FMA utilization
- 5-loop algorithm with cache-optimized blocking (MC=72, KC=256, NC=4096)
- AVX2/AVX-512 SIMD backends with 4-deep software pipelining
- 32.9× speedup over reference implementation for 512×512 matrices
- Toyota Way integration: Jidoka guards, Heijunka scheduler, profiler
- 89 falsification tests covering F1-F55 Popperian criteria
BLIS Benchmark Example - cargo run --release --example blis_benchmark

Documentation

Added BLIS-Style Matrix Multiplication chapter (advanced/blis-gemm.md)
Added comprehensive specification (docs/matrixmultiply-blis.md)

Improved

Test Coverage - 93.78% line coverage, 96.31% function coverage
Performance - 71.5 GFLOP/s peak (~18% theoretical on modern x86_64)

[0.11.1] - 2026-01-04

Improved

Test Coverage - 94.10% → 94.40% line coverage
- PTX builder.rs: 87.88% → 91.04% (+30 tests for warp shuffle, bitwise ops, WMMA)
- PTX registers.rs: 90.42% → 99.57% (all special registers, live range tests)
- PTX types.rs: 97.75% → 99.01% (vector types V2F32/V4F32, all variants)
- Matrix: Added AVX-512 L3 blocking tests (520×520, 512×513, 517×512)
- Vector: Added backend-specific SIMD tests (Scalar, AVX-512)

Added

Matrix Index Trait - impl Index<(usize, usize)> for Matrix<f32>
- Tuple-based element access: matrix[(row, col)]
- Enables more ergonomic matrix element access
Property Testing - 47 PTX kernel property tests all passing
- GEMM, Softmax, LayerNorm, Attention, Batched GEMM
- Validates PTX structure across various dimensions
Mutation Testing - Infrastructure for PTX mutation testing
- Identifies weak test areas in PTX builder
- 322 mutants analyzed

Documentation

Updated README with coverage badge (94.4%)
Added crates.io version badge
Added trueno-gpu Pure Rust PTX section with code examples
Added benchmark results table (AMD Ryzen 9 7950X)
Expanded operations list

[0.11.0] - 2026-01-03

Added

TUI Logging - File-based logging for trueno-monitor
- Logs to ~/.trueno/monitor.log with daily rotation
- RUST_LOG=debug environment variable support
- Structured logging with tracing: startup, GPU detection, stress test results
Real Stress Testing - Uses trueno SIMD/CUDA compute paths
- CPU: 512×512 matrix multiply via AVX-512 (268M FLOPs/op)
- GPU: 4×256MB buffers saturating PCIe bandwidth (22.9 GB/s measured)
- Proper hardware utilization (was 10% CPU, now 100%)

Improved

AVX-512 Coverage - 83.9% → 93.6% line coverage
- Added SIMD path tests for: gelu, swish, tanh, log2, log10
- Tests use 32+ elements to exercise AVX-512 loops (16 elements/iter)
Overall Coverage - 91.8% → 94.0%

Fixed

Removed unused import in gpu_monitor_demo.rs
Added crate documentation to xtask (warning-free build)

[trueno-gpu 0.4.3] - 2026-01-01

Performance

PTX Emission Optimization - 20.9% improvement in PTX code generation
- Pre-allocated String capacity based on instruction count
- Zero-allocation write_instruction() writes directly to buffer
- Zero-allocation write_operand() and write_mem_operand() helpers
- Added Display impl for VirtualReg enabling write!() formatting
- Throughput: 68,316 kernels/sec

Added

Kernel Generation Benchmark - New example bench_kernel_gen
- Benchmarks all kernel types: GEMM, Softmax, LayerNorm, Attention, Quantize
- Measures generation time, PTX size, and throughput
Performance Whitelist - PtxBugAnalyzer::with_performance_whitelist()
- Documents expected register pressure in high-performance kernels
- Whitelists Tensor Core, Attention, and Quantized kernel patterns
- Separates "expected performance tradeoffs" from actual bugs

Fixed

Barrier Safety Analyzer - Fixed false positives in quantized kernels
- Now recognizes *_done suffix labels as loop ends (not just *_end)
- Added explicit patterns: sb_loop_done, sub_block_done, k_block_done
- All 22 barrier safety tests pass

[trueno-gpu 0.4.2] - 2026-01-01

Fixed

PARITY-114: Barrier Safety Bug - Fixed thread divergence causing CUDA error 700
- Root cause: Threads exiting early before bar.sync barriers caused remaining threads to hang
- Fixed 4 kernels: gemm_tensor_core, gemm_wmma_fp16, flash_attention, flash_attention_tensor_core
- Fix pattern: Predicated loads (store 0 first), bounds check AFTER loop, all threads participate in barriers

Added

Barrier Safety Analyzer - Static PTX analysis (PARITY-114 prevention)
- barrier_safety.rs - Detects early-exit-before-barrier patterns
- Kernel::analyze_barrier_safety() - Analyze any kernel for violations
- Kernel::emit_ptx_validated() - Production-ready PTX with safety check
- 19 barrier safety tests (9 analyzer + 10 kernel validation)
Boundary Condition Tests - Test dimensions not divisible by tile size
- GEMM: 17×17, 33×33, 100×100, single row/column
- Attention: seq_len=17, 33, 100
- Prevents future PARITY-114 regressions
CI Target - make barrier-safety for automated validation

Changed

Specification updated to v1.5.0 with 15 new falsification tests (§5.8)
Overall test count: 452 tests (up from 441)

[trueno-gpu 0.4.1] - 2026-01-01

Added

PTX Optimization Passes - NVIDIA CUDA Tile IR aligned (v1.4.0 spec)
- loop_split.rs - Loop splitting with profitability analysis (99.80% coverage)
- tko.rs - Token-Based Ordering for memory dependencies (94.29% coverage)
- Exported CmpOp and Operand in public API
- New example: ptx_optimize demonstrating all optimization passes
Book Chapter - PTX Optimization Passes
- FMA Fusion, Loop Splitting, TKO, Tile Validation documentation
- Academic references and NVIDIA CUDA Tile IR alignment

Changed

Overall test coverage: 94.28% (57 optimize module tests)

[trueno-gpu 0.4.0] - 2026-01-01

Fixed

WMMA Tensor Core Attention - Fixed four PTX bugs enabling Tensor Core attention on RTX 4090
- Register prefix conflict: B32 registers now use %rb prefix instead of %r
- Zero initialization: Use mov.f32 instead of loading from NULL pointer
- FP16 shared memory store: Use B16 type for 16-bit stores
- Address conversion: Added cvta.shared.u64 for WMMA generic pointer requirement
- Added Cvta operation to PtxOp enum for address space conversion

Added

Tensor Core Validation Tests - New kernel validation tests
- tensor_core_attention_ptx_structure - Verifies WMMA instructions and cvta.shared.u64
- tensor_core_attention_ptx_validate_with_ptxas - Validates PTX with NVIDIA ptxas

Performance

Tensor Core attention benchmarked on RTX 4090:
- 64x64: 8.7 GFLOPS (1.01x vs FP32)
- 256x64: 80.0 GFLOPS (1.06x vs FP32)
- 512x64: 202.5 GFLOPS (1.03x vs FP32)

[0.9.0] - 2025-12-31

Added

CUDA Tile GPU Optimizations - Major performance improvements for GPU kernels
TensorView and PartitionView - New abstractions for tiled reduction

[0.8.7] - 2025-12-16

Changed

Dependencies: Updated trueno-gpu to 0.2.2

[trueno-explain 0.2.0] - 2025-12-16

Added

PTX Bug Detection - Static analysis for PTX to catch common bugs
- 12 bug classes across 3 severity levels (P0 Critical, P1 High, P2 Medium)
- PtxBugAnalyzer with default, strict, and whitelist modes
- Detects: shared memory addressing bugs, missing barriers, register pressure, placeholder code, dead code, empty loops, missing bounds checks
- with_quantized_whitelist() for Q4K/Q5K/Q6K/Q8K kernels
- Coverage tracking with PtxCoverageTracker
Examples
- deep_bug_hunt - Analyze all trueno-gpu kernels (30 kernels)
- analyze_realizar - Analyze external hand-rolled PTX
- ptx_inspector - Deep dive into specific kernel PTX

Documentation

New chapter: PTX Bug Detection
190 new tests for bug detection

[trueno-gpu 0.2.2] - 2025-12-16

Changed

Internal: Reduced predicate pressure in tiled GEMM by using two branches instead of and_pred
No API changes

[0.7.3] - 2025-11-25

Added ✨

WebGPU for WASM (gpu-wasm feature)
- Cross-platform GPU compute: native and browser support
- Async-first API: all GPU operations have *_async variants
- Runtime detection via runtime::sync_available()
- Enables trueno-viz browser-based visualization
Cross-platform GPU API
- GpuDevice::new_async() - Works on all platforms
- All operations have async variants (relu_async, matmul_async, etc.)

Documentation 📚

Complete rewrite of GPU Backend chapter
Added WebGPU/WASM section to GPU Performance
trueno-viz integration examples

Fixed 🐛

Type inference fixes for empty slice comparisons
Parameter naming in select_backend_for_operation

[0.7.1] - 2025-11-24

Added ✨

EXTREME PMAT Integration - O(1) Quality Gates for automated quality enforcement
Golden Trace Validation - Syscall-level performance regression detection with Renacer v0.6.2+
GPU Batch API Example - Demonstration of 3x transfer reduction for chained operations

Fixed 🐛

Replaced .unwrap() with .expect() in examples for better error messages
Corrected relative paths in golden-trace-validation.md documentation

Infrastructure 🔧

GitHub Actions workflow for automated golden trace validation
Enhanced gitignore for benchmark logs

Dependencies 📦

Updated all dependencies to latest versions (wgpu 27.0.1, criterion 0.7, thiserror 2.0.17)

Quality 🎯

Test coverage: 90.41% (exceeds 90% requirement)
942 tests passing (up from 936)
All quality gates passing
Pre-commit hooks enforce coverage threshold

[0.7.0] - 2025-11-22

Performance - Phase 3: Large Matrix Optimization 🚀

Achievement: 18% improvement for 1024×1024 matrices via 3-level cache blocking

3-level cache hierarchy (L3 → L2 → micro-kernel) for matrices ≥512×512
- L3 blocks: 256×256 (fits in 4-16MB L3 cache)
- L2 blocks: 64×64 (fits in 256KB L2 cache)
- Micro-kernel: 4×1 AVX2/FMA (register blocking)
- Smart threshold: Only activates for matrices ≥512×512
Zero-allocation implementation:
- No Vec allocations in hot path
- Code duplication with if/else branches
- Preserves fast 2-level path for smaller matrices
Performance results:
- 1024×1024: 47.4 ms (18% faster than v0.6.0's 57.8 ms) ✅
- 512×512: ~5.3 ms (8.5% improvement)
- 256×256: No regression (uses 2-level path)
- Target: Within 1.5× of NumPy (currently 1.64×)
Testing:
- Added test_matmul_3level_blocking for 512×512 matrices
- 878 tests passing (all existing tests pass)
- Coverage: 90.41% (improved from 90.00%)

Quality & Testing

Test coverage: 90.26% (trueno library, exceeds 90% EXTREME TDD requirement)
Added 60+ new tests across xtask tooling and core library
Fixed clippy warnings (needless_range_loop)
Updated coverage policy: xtask (dev tooling) excluded from main coverage requirement
All quality gates passing: lint, format, tests, coverage

Documentation

Updated Phase 2 book chapter with 3-level blocking details
Added benchmark data for 512×512 and 1024×1024
GitHub issue #34 tracking Phase 3 progress

[0.6.0] - 2025-11-21

Performance - Phase 2: NumPy Performance Parity 🎯

Major Achievement: Pure Rust matches NumPy/OpenBLAS performance at 256×256 matrices

4×1 AVX2 micro-kernel implementation (Pure Rust, zero external dependencies)
- Fused Multiply-Add (FMA) instructions for 3× throughput
- Register blocking: 4 YMM accumulators stay in CPU registers
- Eliminates memory traffic, maximizes compute utilization
2-level cache blocking (outer loop: L2, inner loop: L1)
- Outer blocks: 64×64 (fits in L2 cache)
- Inner blocks: 4×4 (micro-kernel size, stays in registers)
- Adaptive based on matrix size
Performance results:
- 256×256: 7.3 ms (matches NumPy/OpenBLAS's 7.3 ms) ✅
- 128×128: 0.9 ms (vs NumPy 0.9 ms - parity achieved)
- 64×64: 0.12 ms (vs NumPy 0.12 ms - parity)
- Validates Phase 2 goal: pure Rust can match C/Fortran + assembly
Algorithm validation:
- Correctness: test_matmul_simd_equivalence_large with 100×100 matrices
- No regressions: All 843 tests passing
- Coverage: 90.00% (meets EXTREME TDD requirement)

Documentation

Added Phase 2 book chapter documenting micro-kernel design
Updated performance benchmark tables with Phase 2 results
Added "Pragmatic Parity" definition to glossary

Earlier Releases

For earlier releases, see the CHANGELOG.md in the repository root.

Installation:

cargo add trueno

Links:

Trueno - High-Performance SIMD/GPU Compute Library