Cost-Based Backend Selection

The cost-based backend dispatcher is the brain of Trueno-DB. It automatically selects the optimal execution backend (GPU, SIMD, or Scalar) based on a physics-based cost model.

The Problem

Modern systems have multiple execution backends:

  • GPU: Massively parallel (1000s of cores), but high transfer overhead
  • SIMD: Medium parallelism (8-16 lanes), low overhead
  • Scalar: No parallelism, lowest overhead

Question: For a given query workload, which backend is fastest?

The 5x Rule

Trueno-DB uses a simple, physics-based rule:

Use GPU only if compute time > 5x transfer time

This rule comes from Genchi Genbutsu (Go and See): measuring real PCIe bandwidth and GPU performance.

Cost Model

Rule 1: Minimum Data Size (10 MB)

const MIN_GPU_DATA_SIZE_BYTES: usize = 10_000_000; // 10 MB

if total_bytes < MIN_GPU_DATA_SIZE_BYTES {
    return Backend::Simd; // Transfer overhead not worthwhile
}

Rationale: For small datasets (<10 MB), PCIe transfer overhead dominates. SIMD is faster.

Rule 2: PCIe Transfer Time

const PCIE_BANDWIDTH_GBPS: f64 = 32.0; // PCIe Gen4 x16

let pcie_transfer_time_ms =
    (total_bytes as f64 / (PCIE_BANDWIDTH_GBPS * 1_000_000_000.0)) * 1000.0;

Measured on: AMD Ryzen 9 7950X with PCIe Gen4 x16 Peak bandwidth: 32 GB/s (64 GB/s bidirectional) Effective bandwidth: ~28 GB/s (due to protocol overhead)

Rule 3: GPU Compute Time

const GPU_THROUGHPUT_GFLOPS: f64 = 100.0; // Conservative estimate

let estimated_gpu_compute_ms =
    (estimated_flops / (GPU_THROUGHPUT_GFLOPS * 1_000_000_000.0)) * 1000.0;

GPU model: AMD Radeon RX 7900 XTX Peak performance: 61 TFLOP/s (FP32) Sustained performance: ~100 GFLOP/s (conservative, accounts for memory bottlenecks)

Rule 4: Apply 5x Rule

const TRANSFER_OVERHEAD_MULTIPLIER: f64 = 5.0;

if estimated_gpu_compute_ms > pcie_transfer_time_ms * TRANSFER_OVERHEAD_MULTIPLIER {
    Backend::Gpu
} else {
    Backend::Simd
}

Intuition: GPU must do 5x more work than PCIe transfer to justify the overhead.

Example Scenarios

Scenario 1: Small Dataset (1 MB, Simple Sum)

Data: 1 MB (250,000 int32 values)
Operation: SUM (1 FLOP per value)
FLOPs: 250,000

Transfer time: 1 MB / 32 GB/s = 0.03 ms
Compute time: 250,000 / 100 GFLOP/s = 0.0025 ms

Result: 0.0025 ms < 0.03 ms * 5
Backend: SIMD (transfer dominates)

Scenario 2: Large Dataset (1 GB, Complex Aggregation)

Data: 1 GB (250M int32 values)
Operation: SUM + AVG + COUNT + MIN + MAX (5 FLOPs per value)
FLOPs: 1.25 billion

Transfer time: 1 GB / 32 GB/s = 31.25 ms
Compute time: 1.25B / 100 GFLOP/s = 12.5 ms

Result: 12.5 ms < 31.25 ms * 5
Backend: SIMD (still transfer-bound)

Scenario 3: Very Large Compute (1 GB, Hash Join)

Data: 1 GB (build + probe tables)
Operation: Hash join (200 FLOPs per probe)
FLOPs: 50 billion

Transfer time: 1 GB / 32 GB/s = 31.25 ms
Compute time: 50B / 100 GFLOP/s = 500 ms

Result: 500 ms > 31.25 ms * 5
Backend: GPU (compute justifies transfer)

Implementation

See src/backend/mod.rs:

impl BackendDispatcher {
    #[must_use]
    pub fn select(total_bytes: usize, estimated_flops: f64) -> Backend {
        // Rule 1: Minimum data size threshold
        if total_bytes < MIN_GPU_DATA_SIZE_BYTES {
            return Backend::Simd;
        }

        // Rule 2: Calculate transfer time
        let pcie_transfer_time_ms =
            (total_bytes as f64 / (PCIE_BANDWIDTH_GBPS * 1_000_000_000.0)) * 1000.0;

        // Rule 3: Estimate GPU compute time
        let estimated_gpu_compute_ms =
            (estimated_flops / (GPU_THROUGHPUT_GFLOPS * 1_000_000_000.0)) * 1000.0;

        // Rule 4: Apply 5x rule
        if estimated_gpu_compute_ms > pcie_transfer_time_ms * TRANSFER_OVERHEAD_MULTIPLIER {
            Backend::Gpu
        } else {
            Backend::Simd
        }
    }
}

Testing

See tests/backend_selection_test.rs:

#[test]
fn test_small_dataset_selects_cpu() {
    let total_bytes = 1_000_000; // 1 MB
    let estimated_flops = 1_000_000.0; // Simple sum
    let backend = BackendDispatcher::select(total_bytes, estimated_flops);
    assert!(matches!(backend, Backend::Simd));
}

#[test]
fn test_very_large_compute_selects_gpu() {
    let total_bytes = 1_000_000_000; // 1 GB
    let estimated_flops = 100_000_000_000.0; // 100 GFLOP
    let backend = BackendDispatcher::select(total_bytes, estimated_flops);
    assert!(matches!(backend, Backend::Gpu));
}

Arithmetic Intensity

The cost model implicitly calculates arithmetic intensity:

AI = FLOPs / Bytes

Rule of thumb:

  • AI < 1: Memory-bound (use SIMD)
  • 1 ≤ AI < 10: Balanced (depends on dataset size)
  • AI ≥ 10: Compute-bound (use GPU)

Future Improvements

Dynamic Profiling

Instead of static constants, profile actual hardware:

// Measure PCIe bandwidth
let bandwidth = measure_pcie_bandwidth();

// Measure GPU throughput
let throughput = measure_gpu_throughput();

Query Optimizer Integration

Integrate with query optimizer to estimate FLOPs:

fn estimate_flops(query: &QueryPlan) -> f64 {
    match query {
        QueryPlan::Sum(_) => num_rows as f64,
        QueryPlan::HashJoin(build, probe) => {
            build.num_rows() * 50.0 + probe.num_rows() * 200.0
        }
        // ...
    }
}

Multi-GPU Support

Extend to support multiple GPUs:

enum Backend {
    Gpu { device_id: usize },
    Simd,
    Scalar,
}

Academic References

  • Breß et al. (2014): "Robust Query Processing in Co-Processor-accelerated Databases"
  • Gregg & Hazelwood (2011): "Where is the data? Why you cannot debate CPU vs. GPU performance without the answer"
  • He et al. (2008): "Relational joins on graphics processors"

Toyota Way Principles

  • Genchi Genbutsu: Measured real PCIe bandwidth (32 GB/s)
  • Muda: Eliminate waste by avoiding GPU when transfer dominates
  • Jidoka: Built-in quality through comprehensive testing

Next Steps

See the Backend Dispatcher section for implementation details of the cost-based backend selection algorithm.