GPU Syscall Tracing with Renacer

This chapter documents the syscall-level analysis of Trueno-DB's GPU operations using Renacer, a next-generation syscall tracer built at Practical AI with DWARF debug info integration.

Overview

Goal: Understand the low-level kernel interactions during GPU computation to validate our zero-copy design and identify potential bottlenecks.

Toyota Way Principle: Genchi Genbutsu (Go and See) - Direct observation of actual GPU operations at the syscall level proves our architecture claims.

Tracing Tool: Renacer v0.5.1 - A modern strace replacement with DWARF support, function timing, and statistical analysis.

Test Methodology

Building the Release Binary

The key insight is to build the test binary with --all-features to include GPU tests, then trace the pre-built binary directly:

# Step 1: Build test binary WITHOUT running tests
cargo test --release --all-features --lib --no-run

# Step 2: Find the binary (output shows path)
# Example: target/release/deps/trueno_db-e734efbf8b7a7c79

# Step 3: List available GPU tests
./target/release/deps/trueno_db-<hash> --list | grep "gpu::"

# Step 4: Trace specific test with renacer
renacer -c -T --exact -- \
  ./target/release/deps/trueno_db-<hash> \
  gpu::tests::test_gpu_sum_basic

###Why This Approach?

Previous Mistake: Tracing cargo test captured compilation (272s, 99.94% futex from rustc thread synchronization) instead of test execution.

Correct Approach: Pre-build the binary, then trace only the test execution. This isolates GPU operations from build overhead.

Test: gpu::tests::test_gpu_sum_basic Location: src/gpu/mod.rs:464 Dataset: 1,000 i32 values (4 KB) Operation: GPU SUM aggregation with parallel reduction Hardware: wgpu auto-detected adapter (Vulkan/Metal/DX12) Build: Release mode (--release) Total Time: 430ms (0.43s)

Syscall Analysis

Top Syscalls by Time

Syscall	Time (s)	% Time	Calls	Avg (µs)	Purpose
futex	0.426	98.46%	3	142,055	Thread synchronization (tokio runtime)
ioctl	0.003	0.74%	14	227	GPU device control (wgpu operations)
munmap	0.002	0.48%	17	121	Memory unmapping (cleanup)
close	0.0003	0.07%	21	14	File descriptor cleanup
openat	0.0002	0.05%	22	9	File/device opening (12 failures)
mmap	0.0002	0.04%	18	10	Memory mapping (Arrow buffers)
read	0.0002	0.04%	18	8	File/device reads
Other	0.0003	0.07%	77	-	statx, write, mprotect, etc.

Total: 190 syscalls in 432.8ms (2,278 µs average per call)

Key Findings

1. GPU Operations Are Async (98.46% Futex Time)

Observation: 98.46% of wall-clock time spent in futex (3 calls averaging 142ms each).

Explanation: Our GPU test is async and runs on a tokio runtime:

#[tokio::test]
async fn test_gpu_sum_basic() {
    let engine = GpuEngine::new().await.unwrap();
    // ... GPU operations ...
}

The futex calls represent the async runtime waiting for:

GPU adapter initialization (GpuEngine::new())
Buffer allocation and transfer to VRAM
Shader execution and result retrieval

Validation: This proves our async design works - the CPU yields during GPU operations rather than blocking.

2. GPU Device Control via ioctl (14 calls, 3.2ms)

Observation: 14 ioctl syscalls totaling 3.2ms (0.74% of time).

Explanation: ioctl (input/output control) is the Linux kernel's interface to GPU drivers. These calls represent:

ioctl Purpose	Estimated Calls	Operations
Device initialization	~4	Adapter query, device creation, queue setup
Buffer operations	~4	VRAM allocation, CPU→GPU transfer
Shader execution	~2	Dispatch compute shader, synchronization
Result retrieval	~2	GPU→CPU transfer, buffer reads
Cleanup	~2	Resource deallocation

Validation: 14 ioctl calls for a simple SUM operation is reasonable. Production workloads will amortize this overhead across larger datasets.

3. Memory Operations (35 calls, 2.4ms)

Breakdown:

mmap (18 calls, 0.18ms): Memory mapping for Arrow buffers and GPU shared memory
munmap (17 calls, 2.06ms): Cleanup during Drop impls
mprotect (8 calls, 0.07ms): Memory protection changes

Zero-Copy Validation: Only 18 mmap calls for a test involving:

Arrow array creation (1 mmap)
GPU buffer allocation (1 mmap for staging buffer)
Result buffer (1 mmap)

Interpretation: Our zero-copy design is working - we're not seeing excessive memory copies via read/write syscalls.

4. File Operations (68 calls, 0.59ms)

Breakdown:

openat (22 calls, 12 failures): Probing GPU devices, loading shared libraries
read (18 calls): Reading GPU device metadata, shader SPIR-V
close (21 calls): Cleanup
statx (13 calls, 4 failures): File existence checks (shader cache, libraries)

Interpretation: These are mostly one-time initialization costs. Production usage with shader caching will reduce this overhead.

Comparison: GPU Test vs Compilation

Previous Mistake Analysis

When we traced cargo test gpu::tests::test_gpu_sum_basic (without pre-building), we captured rustc compilation:

Metric	GPU Test (Correct)	Compilation (Mistake)	Ratio
Total Time	0.43s	272.36s	633x slower
Futex %	98.46% (async wait)	99.94% (rustc threads)	Similar %
Futex Time	0.426s	272s	638x slower
Syscall Count	190	16,482	87x more
ioctl Calls	14	0	N/A (no GPU)

Lesson: Always pre-build with --no-run, then trace the binary directly.

Production Implications

Syscall Overhead at Scale

For a 1M row dataset (4 MB), assuming syscall counts scale linearly:

Operation	Per Test (4 KB)	Per 1M Rows (4 MB)	Amortization
ioctl	14 calls (3.2ms)	~14 calls (3.2ms)	1000x better per KB
mmap	18 calls (0.18ms)	~20 calls (0.2ms)	1000x better per KB
Total Syscalls	190 (0.43s)	~200 (varies)	Fixed cost amortized

Key Insight: GPU syscall overhead is mostly fixed cost (device init, shader compilation). Larger datasets pay the same overhead, making GPU increasingly efficient.

Zero-Copy Validation

Metric: Only 0.18ms spent in mmap (18 calls) for buffer setup.

Comparison: If we were copying data via read/write, we'd expect:

1M i32 values = 4 MB
At 1 GB/s syscall throughput: ~4ms just for copy
Actual GPU compute: <1ms (50-100x CPU)
Copy would dominate (4ms copy vs 1ms compute)

Validation: Our zero-copy design (Arrow → wgpu shared buffers → GPU VRAM) avoids this overhead.

Reproducing These Results

Prerequisites

# Install renacer
cargo install renacer

# Verify installation
renacer --version  # Should show v0.5.1 or later

Step-by-Step Tracing

# 1. Build test binary with GPU feature
cargo test --release --all-features --lib --no-run

# 2. Find binary path (look for "Executable" line)
# Example output: target/release/deps/trueno_db-e734efbf8b7a7c79

# 3. List available GPU tests
./target/release/deps/trueno_db-<hash> --list | grep "gpu::"

# 4. Trace with syscall summary (-c) and time statistics (-T)
renacer -c -T --exact -- \
  ./target/release/deps/trueno_db-<hash> \
  gpu::tests::test_gpu_sum_basic | tee gpu_trace.txt

# 5. (Optional) Trace with DWARF source locations
renacer -c --source -- \
  ./target/release/deps/trueno_db-<hash> \
  gpu::tests::test_gpu_sum_basic | tee gpu_trace_source.txt

Expected Output

running 1 test
test gpu::tests::test_gpu_sum_basic ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 67 filtered out; finished in 0.43s

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.46    0.426166      142055         3           futex
  0.74    0.003191         227        14         1 ioctl
  0.48    0.002064         121        17           munmap
  0.07    0.000284          13        21           close
  0.05    0.000208           9        22        12 openat
  0.04    0.000176           9        18           mmap
  0.04    0.000152           8        18           read
  ...
------ ----------- ----------- --------- --------- ----------------
100.00    0.432843        2278       190        19 total

Troubleshooting

Issue: "No such file or directory" when tracing

Cause: Binary hash changes after recompilation.

Solution: Re-run step 2 to find the current binary path after each cargo test --no-run.

Issue: "0 tests run" or "filtered out"

Cause: Test name doesn't match exactly, or GPU feature not enabled.

Solution:

# Use --exact flag
./target/release/deps/trueno_db-<hash> --exact gpu::tests::test_gpu_sum_basic

# Verify GPU feature is enabled
cargo test --release --all-features --lib --no-run

Benchmarking Methodology - General benchmark practices
Backend Comparison - GPU vs SIMD vs Scalar theory
PCIe Transfer Analysis - Bandwidth measurements and 5x rule
Competitive Benchmarks - vs DuckDB, SQLite

Next Steps

Run trace: renacer -c -T -- ./target/release/deps/trueno_db-<hash> --exact gpu::tests::test_gpu_sum_basic
Analyze results: Compare your syscall counts to the table above
Scale up: Trace larger datasets (100K, 1M rows) to see syscall amortization
Optimize: If ioctl count is unexpectedly high, investigate shader caching

Feedback

Found an issue with the GPU tracing? Report at: GitHub Issues: paiml/trueno-db/issues

Trueno-DB - GPU-Accelerated Database with EXTREME TDD