GPU Acceleration
GPU acceleration is the highest tier of the MoE backend selection in Phase 3. Batuta uses the wgpu crate (via Trueno) for portable GPU compute across Vulkan, Metal, DX12, and WebGPU.
The 5x PCIe Dispatch Rule
GPU dispatch incurs overhead from data transfer across the PCIe bus. Based on Gregg and Hazelwood (2011), GPU compute is only beneficial when:
compute_time > 5 * transfer_time
The BackendSelector implements this as a cost model:
#![allow(unused)]
fn main() {
pub fn select_backend(&self, data_bytes: usize, flops: u64) -> Backend {
let transfer_s = data_bytes as f64 / self.pcie_bandwidth;
let compute_s = flops as f64 / self.gpu_gflops;
if compute_s > self.min_dispatch_ratio * transfer_s {
Backend::GPU
} else {
Backend::SIMD
}
}
}
Default parameters assume PCIe 4.0 x16 (32 GB/s) and A100-class throughput (20 TFLOPS).
When GPU Is Beneficial
| Operation | Data Size | Recommended Backend | Why |
|---|---|---|---|
| Element-wise add | Any | Never GPU | Memory-bound, PCIe overhead dominates |
| Dot product | < 100K | SIMD | Transfer cost exceeds compute |
| Dot product | > 100K | GPU | Sufficient compute to amortize transfer |
| Matrix multiply | < 10K | SIMD | Small matrices fit in SIMD registers |
| Matrix multiply | > 10K | GPU | O(n^3) compute dominates O(n^2) transfer |
Matrix Multiplication Example
#![allow(unused)]
fn main() {
let selector = BackendSelector::new();
// Small matrix: SIMD is faster
let backend = selector.select_for_matmul(64, 64, 64);
// --> Backend::SIMD
// Large matrix: GPU is faster
let backend = selector.select_for_matmul(1024, 1024, 1024);
// --> Backend::GPU
}
Customizing Thresholds
The selector can be configured for different hardware:
#![allow(unused)]
fn main() {
let selector = BackendSelector::new()
.with_pcie_bandwidth(64e9) // PCIe 5.0
.with_gpu_gflops(40e12) // RTX 4090
.with_min_dispatch_ratio(3.0); // More aggressive dispatch
}
GPU Backends via wgpu
Trueno abstracts GPU compute through wgpu, which maps to the native GPU API on each platform:
| Platform | API |
|---|---|
| Linux | Vulkan |
| macOS | Metal |
| Windows | DX12 / Vulkan |
| Browser | WebGPU |
When to Avoid GPU
GPU dispatch should be avoided when:
- Data fits entirely in L1/L2 cache (SIMD will be faster)
- The operation is memory-bound (element-wise operations)
- The program will run in WASM without WebGPU support
- Latency matters more than throughput (kernel launch overhead is ~10us)
Navigate: Table of Contents