Trueno GPU: Honest Acceleration Analysis
Toyota Way Principle (Genchi Genbutsu): Go and see for yourself. Don’t assume GPU is faster - measure it.
Status: Complete
The Promise vs Reality of GPU Acceleration
GPU acceleration is marketed as a silver bullet for ML performance. The reality is more nuanced:
GPU Acceleration: The Uncomfortable Truth
───────────────────────────────────────────────────────────────
"GPU is always faster" → FALSE for small operations
"Just add GPU support" → Transfer overhead matters
"CUDA solves everything" → Memory bandwidth is the limit
What really determines performance:
├─ Operation size (GPU needs scale)
├─ Memory transfer patterns (PCIe is slow)
├─ Parallelism (GPU needs thousands of independent ops)
└─ Your specific workload (always benchmark)
───────────────────────────────────────────────────────────────
Validation
Run all chapter examples:
make run-ch07 # Run all examples
make run-ch07-gpu # GPU acceleration concepts
make run-ch07-comparison # CPU vs GPU comparison
make test-ch07 # Run all tests
GPU vs CPU Crossover Analysis
The critical question: At what size does GPU become faster?
Matrix Multiplication: CPU vs GPU (Simulated)
───────────────────────────────────────────────────────────────
Size │ CPU (ms) │ GPU (ms) │ Speedup │ Winner
────────┼────────────┼────────────┼──────────┼────────
16×16 │ 0.001 │ 0.070 │ 0.01x │ CPU
32×32 │ 0.005 │ 0.070 │ 0.07x │ CPU
64×64 │ 0.030 │ 0.070 │ 0.43x │ CPU
128×128│ 0.200 │ 0.070 │ 2.86x │ GPU
256×256│ 1.500 │ 0.071 │ 21.1x │ GPU
512×512│ 12.000 │ 0.075 │ 160.0x │ GPU
───────────────────────────────────────────────────────────────
Key insight: GPU overhead dominates for small operations.
GPU Overhead Breakdown
For a 32×32 matrix multiplication:
#![allow(unused)]
fn main() {
// GPU Time Components
let transfer_time = 0.100; // Data to GPU + results back (ms)
let kernel_overhead = 0.020; // Kernel launch, scheduling (ms)
let compute_time = 0.001; // Actual GPU computation (ms)
// Total GPU time: 0.121 ms
// CPU time: 0.005 ms
// GPU is 24x SLOWER for this size!
}
The transfer overhead alone exceeds total CPU time for small operations.
When GPU Actually Helps
GPU acceleration provides real benefits when:
1. Large Matrix Operations
#![allow(unused)]
fn main() {
// 512×512 matrix multiplication
let size = 512;
let (cpu_time, _) = cpu_matmul(size); // ~12 ms
let gpu_time = simulated_gpu_matmul(size); // ~0.075 ms
// Speedup: 160x
// GPU is clearly beneficial at this scale
}
2. Batch Processing
#![allow(unused)]
fn main() {
// Process many small operations together
// Bad: 1000 separate GPU calls (overhead dominates)
// Good: 1 batched GPU call with 1000 operations
let batch_overhead = 0.1; // ms (fixed cost)
let per_op_cost = 0.0001; // ms (tiny per operation)
// 1000 ops batched: 0.1 + 1000 * 0.0001 = 0.2 ms
// 1000 ops separate: 1000 * 0.1 = 100 ms
// Batching: 500x faster
}
3. Parallel Element-wise Operations
#![allow(unused)]
fn main() {
// ReLU on 1M elements
let data: Vec<f32> = (0..1_000_000).map(|i| i as f32).collect();
// GPU: All elements in parallel
// CPU: Sequential (even with SIMD, limited parallelism)
// GPU speedup: 10-50x for large element-wise ops
}
GPU Failure Cases (Brutal Honesty)
1. Small Batches
Problem: Transfer overhead > compute time
Example: 100-element vector operations
Result: CPU is 10-100x faster
Solution: Batch operations before GPU transfer
2. Sequential Dependencies
Problem: GPU excels at parallelism, not sequences
Example: RNN with sequential state updates
Result: GPU advantage reduced to 2-3x at best
Solution: Keep sequential logic on CPU
3. Memory-Bound Operations
Problem: GPU memory bandwidth is finite (~900 GB/s)
Example: Simple vector addition (memory-bound, not compute-bound)
Result: Speedup limited by memory bandwidth, not compute
Solution: Optimize data layout for coalesced access
4. Dynamic Control Flow
Problem: GPU threads diverge on branches
Example: Sparse operations with conditionals
Result: Many GPU threads idle waiting for others
Solution: Restructure as data-parallel operations
CPU SIMD: The Underrated Alternative
trueno uses CPU SIMD for significant acceleration without GPU overhead:
x86-64 (AVX2/AVX-512):
├─ AVX2: 256-bit vectors (8 × f32 per instruction)
├─ AVX-512: 512-bit vectors (16 × f32 per instruction)
└─ Available on most modern CPUs
ARM (NEON):
└─ 128-bit vectors (4 × f32 per instruction)
Advantages over GPU:
├─ Zero transfer overhead
├─ Lower latency for small operations
├─ Better cache utilization
└─ No GPU hardware required
SIMD vs GPU Comparison
Operation: 10,000 element dot product
───────────────────────────────────────
CPU (scalar): 0.015 ms
CPU (SIMD): 0.003 ms (5x)
GPU (simulated): 0.050 ms
Winner: CPU SIMD
SIMD provides 16x speedup over GPU
for this operation size
───────────────────────────────────────
Decision Framework
Use this framework to decide CPU vs GPU:
Decision Tree for GPU Acceleration
───────────────────────────────────────────────────────────────
1. Operation size < 10,000 elements?
└─ YES → Use CPU (SIMD)
2. Operation is memory-bound (simple arithmetic)?
└─ YES → Benchmark both, GPU may not help
3. Sequential dependencies?
└─ YES → Keep on CPU
4. Can batch multiple operations?
└─ NO → CPU likely wins
5. Size > 100,000 AND compute-bound AND parallelizable?
└─ YES → GPU will likely help significantly
6. ALWAYS: Benchmark YOUR specific workload
───────────────────────────────────────────────────────────────
EU AI Act Compliance for GPU Operations
GPU operations must maintain compliance:
Article 10: Data Governance
#![allow(unused)]
fn main() {
// GPU memory is isolated per process
// No cross-tenant data leakage
// Local execution - no cloud GPU required
let local_gpu = GpuContext::new(device_id)?;
let result = local_gpu.execute(operation); // Never leaves machine
}
Article 13: Transparency
#![allow(unused)]
fn main() {
// Deterministic GPU operations require:
// 1. Fixed random seeds
// 2. Deterministic reduction algorithms
// 3. Reproducible execution order
let config = GpuConfig {
deterministic: true, // Forces reproducible behavior
seed: 42, // Fixed seed for any randomness
};
}
Article 15: Robustness
#![allow(unused)]
fn main() {
// Graceful CPU fallback on GPU failure
fn execute_with_fallback(op: Operation) -> Result<Tensor> {
match gpu_execute(&op) {
Ok(result) => Ok(result),
Err(GpuError::OutOfMemory) => {
log::warn!("GPU OOM, falling back to CPU");
cpu_execute(&op) // Deterministic fallback
}
Err(e) => Err(e.into()),
}
}
}
Testing GPU Code
#![allow(unused)]
fn main() {
#[test]
fn test_gpu_beats_cpu_at_scale() {
let size = 512;
let (cpu_time, _) = cpu_matmul(size);
let gpu_time = simulated_gpu_matmul(size);
assert!(gpu_time < cpu_time,
"GPU should be faster for 512×512 matrices");
}
#[test]
fn test_matmul_determinism() {
let (_, result1) = cpu_matmul(32);
let (_, result2) = cpu_matmul(32);
assert_eq!(result1, result2,
"Matrix multiplication must be deterministic");
}
}
Performance Summary
| Workload | Elements | CPU SIMD | GPU | Winner |
|---|---|---|---|---|
| Dot product | 1K | 0.001 ms | 0.05 ms | CPU |
| Dot product | 1M | 1.0 ms | 0.1 ms | GPU |
| Matrix mult | 64×64 | 0.03 ms | 0.07 ms | CPU |
| Matrix mult | 512×512 | 12 ms | 0.075 ms | GPU |
| ReLU | 10K | 0.01 ms | 0.05 ms | CPU |
| ReLU | 1M | 0.5 ms | 0.06 ms | GPU |
Key Takeaways
- GPU is not magic: Transfer overhead matters
- Size determines winner: <10K elements → CPU, >100K → GPU
- CPU SIMD is underrated: 5-10x speedup with zero overhead
- Always benchmark: Your workload is unique
- Batch for GPU: Amortize fixed overhead across operations
Next Steps
- Chapter 8: aprender ML training with GPU-accelerated backpropagation
- Chapter 9: realizar inference with optimized GPU kernels
- Chapter 10: trueno-db with GPU-accelerated vector search
Source Code
Full implementation: examples/ch07-trueno-gpu/
# Verify all claims
make test-ch07
# Run examples
make run-ch07