GPU Monitoring

This chapter covers trueno's GPU monitoring capabilities as defined in TRUENO-SPEC-010.

Overview

Trueno provides comprehensive GPU monitoring through two complementary approaches:

  1. Cross-platform wgpu backend - Works on any system with Vulkan, Metal, or DX12
  2. Native CUDA backend - Direct access to NVIDIA GPU information via CUDA Driver API

Quick Start

use trueno::monitor::{GpuMonitor, GpuDeviceInfo, MonitorConfig};

// Enumerate all available GPUs
let devices = GpuDeviceInfo::enumerate()?;
for dev in &devices {
    println!("[{}] {} ({:.2} GB)", dev.index, dev.name, dev.vram_gb());
}

// Create a monitor with history buffer
let monitor = GpuMonitor::new(0, MonitorConfig::default())?;

// Collect metrics over time
for _ in 0..10 {
    let metrics = monitor.collect()?;
    println!("Memory: {:.1}% used", metrics.memory.usage_percent());
}

Feature Flags

FeatureDescription
gpuEnable wgpu-based GPU monitoring (cross-platform)
cuda-monitorEnable native CUDA monitoring (NVIDIA only)

Enable features in your Cargo.toml:

[dependencies]
trueno = { version = "0.8", features = ["gpu", "cuda-monitor"] }

Device Discovery

GpuDeviceInfo

Represents a discovered GPU device:

pub struct GpuDeviceInfo {
    pub index: usize,
    pub name: String,
    pub vendor: GpuVendor,
    pub backend: GpuBackend,
    pub vram_total: u64,
    pub compute_capability: Option<(u32, u32)>,
    pub driver_version: Option<String>,
}

Methods:

  • enumerate() -> Result<Vec<GpuDeviceInfo>, MonitorError> - List all GPUs
  • vram_gb() -> f64 - Get VRAM in gigabytes
  • supports_cuda() -> bool - Check CUDA support

GpuVendor

GPU manufacturer identification:

pub enum GpuVendor {
    Nvidia,
    Amd,
    Intel,
    Apple,
    Unknown(u32),
}

PCI Vendor ID Mapping:

Vendor IDVendor
0x10deNVIDIA
0x1002AMD
0x8086Intel
0x106bApple

GpuBackend

Graphics/compute backend:

pub enum GpuBackend {
    Vulkan,
    Metal,
    Dx12,
    Cuda,
    WebGpu,
    OpenGl,
    Cpu,
}

Memory Monitoring

GpuMemoryMetrics

Real-time memory statistics:

pub struct GpuMemoryMetrics {
    pub total: u64,      // Total VRAM in bytes
    pub used: u64,       // Used VRAM in bytes
    pub free: u64,       // Free VRAM in bytes
}

Methods:

  • usage_percent() -> f64 - Memory utilization (0.0-100.0)
  • available_gb() -> f64 - Free memory in GB

GpuMonitor

The GpuMonitor provides continuous monitoring with a ring buffer for history:

// Configure monitoring
let config = MonitorConfig {
    poll_interval: Duration::from_millis(100),
    history_size: 1000,
};

// Create monitor for device 0
let monitor = GpuMonitor::new(0, config)?;

// Collect a sample
let metrics = monitor.collect()?;

// Get sample age
println!("Sample age: {:?}", metrics.age());

// Check history
println!("History size: {}", monitor.sample_count());

MonitorConfig

pub struct MonitorConfig {
    pub poll_interval: Duration,  // Default: 100ms
    pub history_size: usize,      // Default: 1000
}

GpuMetrics

Complete metrics snapshot:

pub struct GpuMetrics {
    pub memory: GpuMemoryMetrics,
    pub utilization: GpuUtilization,
    pub thermal: GpuThermalMetrics,
    pub power: GpuPowerMetrics,
    pub clock: GpuClockMetrics,
    pub pcie: GpuPcieMetrics,
    pub timestamp: Instant,
}

CUDA Native Monitoring

For NVIDIA GPUs, enable cuda-monitor for accurate device information via the CUDA Driver API:

use trueno::monitor::{
    cuda_monitor_available,
    enumerate_cuda_devices,
    query_cuda_memory,
};

// Check availability
if cuda_monitor_available() {
    // Enumerate CUDA devices
    let devices = enumerate_cuda_devices()?;

    // Query real-time memory
    let mem = query_cuda_memory(0)?;
    println!("CUDA Memory: {:.1}% used", mem.usage_percent());
}

Why CUDA Native?

AspectwgpuCUDA Native
Device NameGeneric ("NVIDIA GPU")Exact ("GeForce RTX 4090")
Memory InfoEstimatedAccurate (cuMemGetInfo)
PortabilityCross-platformNVIDIA only
Dependencieswgpulibcuda.so/nvcuda.dll

trueno-gpu Module

For direct CUDA access without the trueno facade:

use trueno_gpu::monitor::{CudaDeviceInfo, CudaMemoryInfo};
use trueno_gpu::driver::CudaContext;

// Query device info
let info = CudaDeviceInfo::query(0)?;
println!("GPU: {} ({:.2} GB)", info.name, info.total_memory_gb());

// Create context and query memory
let ctx = CudaContext::new(0)?;
let mem = CudaMemoryInfo::query(&ctx)?;
println!("Memory: {}", mem);  // "8192 / 24576 MB (33.3% used)"

Examples

Run the GPU Monitor Demo

# Cross-platform (wgpu)
cargo run --example gpu_monitor_demo --features gpu

# With CUDA (NVIDIA)
cargo run --example gpu_monitor_demo --features "gpu,cuda-monitor"

Run the CUDA Monitor Example

cargo run -p trueno-gpu --example cuda_monitor --features cuda

Error Handling

pub enum MonitorError {
    NoDevice,           // No GPU found
    DeviceNotFound(u32), // Specific device not found
    BackendError(String), // Backend-specific error
    ContextError(String), // Context creation failed
}

Performance Considerations

  • Poll Interval: Set poll_interval based on your monitoring needs. 100ms is good for visualization; 1s is sufficient for logging.
  • History Size: The ring buffer is fixed-size. Larger sizes consume more memory but allow longer history analysis.
  • CUDA Context: Creating a CUDA context has overhead. Reuse GpuMonitor instances when possible.

References

  • TRUENO-SPEC-010: GPU Monitoring, Tracing, and Visualization
  • Nickolls et al. (2008): GPU parallel computing model
  • CUDA Driver API: cuDeviceGetName, cuDeviceTotalMem, cuMemGetInfo