Trueno GPU: Honest Acceleration Analysis

Toyota Way Principle (Genchi Genbutsu): Go and see for yourself. Don’t assume GPU is faster - measure it.

Status: Complete

The Promise vs Reality of GPU Acceleration

GPU acceleration is marketed as a silver bullet for ML performance. The reality is more nuanced:

GPU Acceleration: The Uncomfortable Truth
───────────────────────────────────────────────────────────────

  "GPU is always faster"     →  FALSE for small operations
  "Just add GPU support"     →  Transfer overhead matters
  "CUDA solves everything"   →  Memory bandwidth is the limit

  What really determines performance:
  ├─ Operation size (GPU needs scale)
  ├─ Memory transfer patterns (PCIe is slow)
  ├─ Parallelism (GPU needs thousands of independent ops)
  └─ Your specific workload (always benchmark)

───────────────────────────────────────────────────────────────

Validation

Run all chapter examples:

make run-ch07           # Run all examples
make run-ch07-gpu       # GPU acceleration concepts
make run-ch07-comparison # CPU vs GPU comparison
make test-ch07          # Run all tests

GPU vs CPU Crossover Analysis

The critical question: At what size does GPU become faster?

Matrix Multiplication: CPU vs GPU (Simulated)
───────────────────────────────────────────────────────────────
   Size   │   CPU (ms) │   GPU (ms) │  Speedup │ Winner
  ────────┼────────────┼────────────┼──────────┼────────
    16×16 │      0.001 │      0.070 │    0.01x │ CPU
    32×32 │      0.005 │      0.070 │    0.07x │ CPU
    64×64 │      0.030 │      0.070 │    0.43x │ CPU
   128×128│      0.200 │      0.070 │    2.86x │ GPU
   256×256│      1.500 │      0.071 │   21.1x  │ GPU
   512×512│     12.000 │      0.075 │  160.0x  │ GPU
───────────────────────────────────────────────────────────────

Key insight: GPU overhead dominates for small operations.

GPU Overhead Breakdown

For a 32×32 matrix multiplication:

#![allow(unused)]
fn main() {
// GPU Time Components
let transfer_time = 0.100;  // Data to GPU + results back (ms)
let kernel_overhead = 0.020; // Kernel launch, scheduling (ms)
let compute_time = 0.001;    // Actual GPU computation (ms)

// Total GPU time: 0.121 ms
// CPU time: 0.005 ms
// GPU is 24x SLOWER for this size!
}

The transfer overhead alone exceeds total CPU time for small operations.

When GPU Actually Helps

GPU acceleration provides real benefits when:

1. Large Matrix Operations

#![allow(unused)]
fn main() {
// 512×512 matrix multiplication
let size = 512;
let (cpu_time, _) = cpu_matmul(size);  // ~12 ms
let gpu_time = simulated_gpu_matmul(size);  // ~0.075 ms

// Speedup: 160x
// GPU is clearly beneficial at this scale
}

2. Batch Processing

#![allow(unused)]
fn main() {
// Process many small operations together
// Bad: 1000 separate GPU calls (overhead dominates)
// Good: 1 batched GPU call with 1000 operations

let batch_overhead = 0.1;  // ms (fixed cost)
let per_op_cost = 0.0001;  // ms (tiny per operation)

// 1000 ops batched: 0.1 + 1000 * 0.0001 = 0.2 ms
// 1000 ops separate: 1000 * 0.1 = 100 ms
// Batching: 500x faster
}

3. Parallel Element-wise Operations

#![allow(unused)]
fn main() {
// ReLU on 1M elements
let data: Vec<f32> = (0..1_000_000).map(|i| i as f32).collect();

// GPU: All elements in parallel
// CPU: Sequential (even with SIMD, limited parallelism)

// GPU speedup: 10-50x for large element-wise ops
}

GPU Failure Cases (Brutal Honesty)

1. Small Batches

Problem: Transfer overhead > compute time
Example: 100-element vector operations
Result: CPU is 10-100x faster
Solution: Batch operations before GPU transfer

2. Sequential Dependencies

Problem: GPU excels at parallelism, not sequences
Example: RNN with sequential state updates
Result: GPU advantage reduced to 2-3x at best
Solution: Keep sequential logic on CPU

3. Memory-Bound Operations

Problem: GPU memory bandwidth is finite (~900 GB/s)
Example: Simple vector addition (memory-bound, not compute-bound)
Result: Speedup limited by memory bandwidth, not compute
Solution: Optimize data layout for coalesced access

4. Dynamic Control Flow

Problem: GPU threads diverge on branches
Example: Sparse operations with conditionals
Result: Many GPU threads idle waiting for others
Solution: Restructure as data-parallel operations

CPU SIMD: The Underrated Alternative

trueno uses CPU SIMD for significant acceleration without GPU overhead:

x86-64 (AVX2/AVX-512):
├─ AVX2: 256-bit vectors (8 × f32 per instruction)
├─ AVX-512: 512-bit vectors (16 × f32 per instruction)
└─ Available on most modern CPUs

ARM (NEON):
└─ 128-bit vectors (4 × f32 per instruction)

Advantages over GPU:
├─ Zero transfer overhead
├─ Lower latency for small operations
├─ Better cache utilization
└─ No GPU hardware required

SIMD vs GPU Comparison

Operation: 10,000 element dot product
───────────────────────────────────────

  CPU (scalar):     0.015 ms
  CPU (SIMD):       0.003 ms  (5x)
  GPU (simulated):  0.050 ms

  Winner: CPU SIMD
  SIMD provides 16x speedup over GPU
  for this operation size

───────────────────────────────────────

Decision Framework

Use this framework to decide CPU vs GPU:

Decision Tree for GPU Acceleration
───────────────────────────────────────────────────────────────

  1. Operation size < 10,000 elements?
     └─ YES → Use CPU (SIMD)

  2. Operation is memory-bound (simple arithmetic)?
     └─ YES → Benchmark both, GPU may not help

  3. Sequential dependencies?
     └─ YES → Keep on CPU

  4. Can batch multiple operations?
     └─ NO → CPU likely wins

  5. Size > 100,000 AND compute-bound AND parallelizable?
     └─ YES → GPU will likely help significantly

  6. ALWAYS: Benchmark YOUR specific workload

───────────────────────────────────────────────────────────────

EU AI Act Compliance for GPU Operations

GPU operations must maintain compliance:

Article 10: Data Governance

#![allow(unused)]
fn main() {
// GPU memory is isolated per process
// No cross-tenant data leakage
// Local execution - no cloud GPU required
let local_gpu = GpuContext::new(device_id)?;
let result = local_gpu.execute(operation);  // Never leaves machine
}

Article 13: Transparency

#![allow(unused)]
fn main() {
// Deterministic GPU operations require:
// 1. Fixed random seeds
// 2. Deterministic reduction algorithms
// 3. Reproducible execution order

let config = GpuConfig {
    deterministic: true,  // Forces reproducible behavior
    seed: 42,             // Fixed seed for any randomness
};
}

Article 15: Robustness

#![allow(unused)]
fn main() {
// Graceful CPU fallback on GPU failure
fn execute_with_fallback(op: Operation) -> Result<Tensor> {
    match gpu_execute(&op) {
        Ok(result) => Ok(result),
        Err(GpuError::OutOfMemory) => {
            log::warn!("GPU OOM, falling back to CPU");
            cpu_execute(&op)  // Deterministic fallback
        }
        Err(e) => Err(e.into()),
    }
}
}

Testing GPU Code

#![allow(unused)]
fn main() {
#[test]
fn test_gpu_beats_cpu_at_scale() {
    let size = 512;
    let (cpu_time, _) = cpu_matmul(size);
    let gpu_time = simulated_gpu_matmul(size);

    assert!(gpu_time < cpu_time,
        "GPU should be faster for 512×512 matrices");
}

#[test]
fn test_matmul_determinism() {
    let (_, result1) = cpu_matmul(32);
    let (_, result2) = cpu_matmul(32);

    assert_eq!(result1, result2,
        "Matrix multiplication must be deterministic");
}
}

Performance Summary

Workload	Elements	CPU SIMD	GPU	Winner
Dot product	1K	0.001 ms	0.05 ms	CPU
Dot product	1M	1.0 ms	0.1 ms	GPU
Matrix mult	64×64	0.03 ms	0.07 ms	CPU
Matrix mult	512×512	12 ms	0.075 ms	GPU
ReLU	10K	0.01 ms	0.05 ms	CPU
ReLU	1M	0.5 ms	0.06 ms	GPU

Key Takeaways

GPU is not magic: Transfer overhead matters
Size determines winner: <10K elements → CPU, >100K → GPU
CPU SIMD is underrated: 5-10x speedup with zero overhead
Always benchmark: Your workload is unique
Batch for GPU: Amortize fixed overhead across operations

Next Steps

Chapter 8: aprender ML training with GPU-accelerated backpropagation
Chapter 9: realizar inference with optimized GPU kernels
Chapter 10: trueno-db with GPU-accelerated vector search

Source Code

Full implementation: examples/ch07-trueno-gpu/

# Verify all claims
make test-ch07

# Run examples
make run-ch07

Keyboard shortcuts

Sovereign AI Stack: EXTREME TDD for EU-Compliant AI Systems