Introduction

Trueno (Spanish: "thunder") is a high-performance Rust library providing unified compute primitives across three execution targets: CPU SIMD, GPU, and WebAssembly. The name reflects the library's mission: to deliver thunderous performance through intelligent hardware acceleration.

The Problem: Performance vs Portability

Modern applications face a critical tradeoff:

Hand-optimized assembly: Maximum performance (2-50x speedup), but unmaintainable and platform-specific
Portable high-level code: Easy to write and maintain, but leaves performance on the table
Unsafe SIMD intrinsics: Good performance, but riddled with unsafe code and platform-specific complexity

Traditional approaches force you to choose between performance, safety, and portability. Trueno chooses all three.

The Solution: Write Once, Optimize Everywhere

Trueno's core philosophy is write once, optimize everywhere:

use trueno::Vector;

// Single API call, multiple backend implementations
let a = Vector::from_slice(&[1.0, 2.0, 3.0, 4.0]);
let b = Vector::from_slice(&[5.0, 6.0, 7.0, 8.0]);
let result = a.add(&b)?;

// Automatically selects best backend:
// - AVX2 on modern Intel/AMD (4-8x speedup)
// - NEON on ARM64 (2-4x speedup)
// - GPU for large workloads (10-50x speedup)
// - WASM SIMD128 in browsers (2x speedup)

Key Features

1. Multi-Target Execution

Trueno runs on three execution targets with a unified API:

Target	Backends	Use Cases
CPU SIMD	SSE2, AVX, AVX2, AVX-512 (x86) NEON (ARM) SIMD128 (WASM)	General-purpose compute, small to medium workloads
GPU	CUDA (NVIDIA via trueno-gpu) Vulkan, Metal, DX12, WebGPU via `wgpu`	Large workloads (100K+ elements), parallel operations
WebAssembly	SIMD128 portable	Browser/edge deployment, serverless functions

2. Runtime Backend Selection

Trueno automatically selects the best available backend at runtime:

┌─────────────────────────────────────────────────┐
│           Trueno Public API (Safe)              │
│  compute(), map(), reduce(), transform()        │
└─────────────────────────────────────────────────┘
                      │
        ┌─────────────┼─────────────┐
        ▼             ▼             ▼
   ┌────────┐   ┌─────────┐   ┌──────────┐
   │  SIMD  │   │   GPU   │   │   WASM   │
   │ Backend│   │ Backend │   │  Backend │
   └────────┘   └─────────┘   └──────────┘
        │             │             │
   ┌────┴────┐   ┌────┴────┐   ┌───┴─────┐
   │ Runtime │   │CUDA/wgpu│   │ SIMD128 │
   │ Detect  │   │ Compute │   │ Portable│
   └─────────┘   └─────────┘   └─────────┘
   │  │  │  │       │
   SSE2 AVX NEON   PTX
      AVX512     (trueno-gpu)

Backend Selection Priority:

GPU (if available + workload > 100K elements)
AVX-512 (if CPU supports)
AVX2 (if CPU supports)
AVX (if CPU supports)
SSE2 (baseline x86_64)
NEON (ARM64)
SIMD128 (WASM)
Scalar fallback

3. Zero Unsafe in Public API

All unsafe code is isolated to backend implementations:

// ✅ SAFE public API
pub fn add(&self, other: &Self) -> Result<Self> {
    // Safe bounds checking, validation
    if self.len() != other.len() {
        return Err(TruenoError::SizeMismatch { ... });
    }

    // ❌ UNSAFE internal implementation (isolated)
    #[cfg(target_arch = "x86_64")]
    if is_x86_feature_detected!("avx2") {
        unsafe { self.add_avx2(other) }
    } else {
        self.add_scalar(other) // Safe fallback
    }
}

Safety guarantees:

Public API is 100% safe Rust
All bounds checked before dispatching to backends
Miri validation for undefined behavior
286 documented SAFETY invariants in backend code

4. Proven Performance

Trueno delivers 2-50x speedups over scalar code:

Operation	Size	Scalar	SSE2	AVX2	AVX-512	GPU
`add_f32`	1K	1.0x	2.1x	4.3x	8.2x	-
`add_f32`	100K	1.0x	2.0x	4.1x	8.0x	3.2x
`add_f32`	1M	1.0x	2.0x	4.0x	7.9x	12.5x
`dot_product`	1M	1.0x	3.1x	6.2x	12.1x	18.7x

All benchmarks validated with:

Coefficient of variation < 5%
100+ iterations for statistical significance
No regressions > 5% vs baseline

5. Extreme TDD Quality

Trueno is built with EXTREME TDD methodology:

>90% test coverage (verified with cargo llvm-cov)
Property-based testing (commutativity, associativity, distributivity)
Backend equivalence tests (scalar vs SIMD vs GPU produce identical results)
Mutation testing (>80% mutation kill rate with cargo mutants)
Zero tolerance for defects (all quality gates must pass)

Real-World Impact: The FFmpeg Case Study

FFmpeg (the world's most-used video codec library) contains:

390 assembly files (~180,000 lines, 11% of codebase)
Platform-specific implementations for x86, ARM, MIPS, PowerPC
Speedups: SSE2 (2-4x), AVX2 (4-8x), AVX-512 (8-16x)

Problems with hand-written assembly:

❌ Unsafe (raw pointers, no bounds checking)
❌ Unmaintainable (390 files, must update all platforms)
❌ Non-portable (separate implementations per CPU)
❌ Expertise barrier (requires assembly knowledge)

Trueno's value proposition:

✅ Safety: Zero unsafe in public API
✅ Portability: Single source → x86/ARM/WASM/GPU
✅ Performance: 85-95% of hand-tuned assembly
✅ Maintainability: Rust type system catches errors at compile time

Who Should Use Trueno?

Trueno is designed for:

ML/AI Engineers - NumPy-like compute primitives for Rust (use with aprender for training)
Systems Programmers - Eliminate unsafe SIMD intrinsics
Game Developers - Fast vector math for physics/graphics
Scientific Computing - High-performance numerical operations
WebAssembly Developers - Portable SIMD for browsers/edge
Transpiler Authors - Safe SIMD target for Depyler/Decy/Ruchy

Design Principles

Trueno follows five core principles:

Write once, optimize everywhere - Single algorithm, multiple backends
Safety via type system - Zero unsafe in public API
Performance must be proven - Every optimization validated with benchmarks (≥10% speedup)
Extreme TDD - >90% coverage, mutation testing, property-based tests
Toyota Way - Kaizen (continuous improvement), Jidoka (built-in quality)

What's Next?

Getting Started - Install Trueno and run your first program
Architecture - Understand the multi-backend design
API Reference - Explore available operations
Performance - See benchmark results and optimization techniques
Examples - Learn from real-world use cases

Project Status

Trueno is under active development at Pragmatic AI Labs:

Current Version: 0.1.0 (Phase 1: Vector operations)
License: MIT/Apache-2.0 dual-licensed
Repository: github.com/paiml/trueno
Issues: github.com/paiml/trueno/issues

Scope:

Trueno: Compute primitives (vectors, matrices, SIMD, GPU) - NumPy equivalent
Aprender: ML framework with autograd and training - PyTorch equivalent

Trueno is the compute backend for higher-level ML libraries. For neural networks and training, see aprender.

Join us in building the future of safe, high-performance compute!

Trueno - High-Performance SIMD/GPU Compute Library