PTX Bug Detection

The trueno-explain crate provides static analysis for PTX (NVIDIA GPU assembly) to detect common bugs and performance issues before runtime.

Overview

Hand-written PTX is error-prone. The PTX bug analyzer catches:

Severity	Bug Class	Description
P0 Critical	`SHARED_MEM_U64`	64-bit addressing for shared memory (undefined behavior)
P0 Critical	`MISSING_BARRIER`	Missing `bar.sync` between shared memory operations
P0 Critical	`LOOP_BRANCH_END`	Unconditional branch to loop end (infinite loop)
P1 High	`HIGH_REG_PRESSURE`	>64 registers per thread (reduces occupancy)
P1 High	`PRED_OVERFLOW`	>8 predicates (causes spills)
P1 High	`PLACEHOLDER_CODE`	Incomplete code ("simplified", "omitted" comments)
P1 High	`EMPTY_LOOP`	Loop without computation
P1 High	`NO_BOUNDS_CHECK`	Missing thread bounds check
P1 High	`REG_SPILLS`	`.local` memory usage (register spills)
P2 Medium	`DEAD_CODE`	Unreachable code after `ret`/`bra`
P2 Medium	`UNOPT_MEM`	Non-vectorized memory access
P2 Medium	`REDUNDANT_MOVES`	Redundant register moves

Quick Start

use trueno_explain::{PtxBugAnalyzer, BugSeverity};

// Analyze PTX string
let ptx = include_str!("kernel.ptx");
let result = PtxBugAnalyzer::new().analyze(ptx);

// Check for bugs
if result.has_bugs() {
    println!("{}", result.format_report());
}

// Check specific severity
let critical = result.count_by_severity(BugSeverity::Critical);
assert_eq!(critical, 0, "No P0 bugs allowed!");

Analyzer Modes

Default Mode

Standard analysis - catches obvious bugs:

let analyzer = PtxBugAnalyzer::new();
let result = analyzer.analyze(ptx);

Strict Mode

Catches more potential issues (may have false positives):

let analyzer = PtxBugAnalyzer::strict();
let result = analyzer.analyze(ptx);

With Whitelist

Suppress known acceptable warnings:

use trueno_explain::PtxBugClass;

let analyzer = PtxBugAnalyzer::new()
    .with_whitelist("tensor_core*", PtxBugClass::HighRegisterPressure,
        "Tensor core kernels need high registers");

Quantized Kernel Whitelist

Pre-configured for quantized kernels (q4k, q5k, q6k, q8k):

// Suppresses HighRegisterPressure for quantized kernels
let analyzer = PtxBugAnalyzer::with_quantized_whitelist();

Examples

Run Deep Bug Hunt

Analyze all trueno-gpu kernels:

cargo run -p trueno-explain --example deep_bug_hunt

Output:

SUMMARY: 30 kernels analyzed
  Total bugs: 16
  P0 Critical: 0
  P1 High: 16
  P2 Medium: 0

BUGS BY CLASS:
  HIGH_REG_PRESSURE         : 16

Analyze External PTX

Analyze hand-rolled PTX from another project:

cargo run -p trueno-explain --example analyze_realizar

Output:

REALIZAR PTX SUMMARY
  Files analyzed: 4
  Total bugs: 18
  P0 Critical: 0
  P1 High: 15
  P2 Medium: 3

Inspect PTX Details

Deep dive into specific kernel PTX:

cargo run -p trueno-explain --example ptx_inspector

Bug Classes in Detail

P0 Critical - Correctness Bugs

SharedMemU64Addressing

Problem: Using 64-bit registers for shared memory addressing.

// BAD: %rd0 is 64-bit
st.shared.f32 [%rd0], %f0;

// GOOD: %r0 is 32-bit
st.shared.f32 [%r0], %f0;

Impact: Undefined behavior, potential silent corruption.

MissingBarrierSync

Problem: No bar.sync between shared memory write and read.

// BAD: Race condition!
st.shared.f32 [%r0], %f0;
ld.shared.f32 %f1, [%r1];  // May read stale data

// GOOD: Barrier ensures visibility
st.shared.f32 [%r0], %f0;
bar.sync 0;
ld.shared.f32 %f1, [%r1];

Impact: Race condition, non-deterministic results.

P1 High - Performance Bugs

HighRegisterPressure

Problem: >64 registers per thread reduces occupancy.

Register count: 120
Max occupancy: 65536 / (120 * 32) = 17 warps/SM (53%)

Impact: Reduced parallelism, lower throughput.

Fix: Reduce live variables, split kernel, or accept lower occupancy for compute-bound kernels.

PlaceholderCode

Problem: Comments indicate incomplete implementation.

// Detected patterns:
// "simplified"
// "omitted"
// "placeholder"
// "for now"
// "TODO"

Impact: Kernel may produce incorrect results or have missing functionality.

P2 Medium - Optimization Opportunities

DeadCode

Problem: Unreachable code after unconditional branch/return.

// BAD: add.f32 is unreachable
ret;
add.f32 %f0, %f1, %f2;

// BAD: mul.f32 is unreachable
bra skip;
mul.f32 %f0, %f1, %f2;
skip:

Impact: Code bloat, wasted compilation time.

UnoptimizedMemoryPattern

Problem: Multiple single-element loads that could be vectorized.

// BAD: 4 separate loads
ld.global.f32 %f0, [%rd0];
ld.global.f32 %f1, [%rd0+4];
ld.global.f32 %f2, [%rd0+8];
ld.global.f32 %f3, [%rd0+12];

// GOOD: Single vectorized load
ld.global.v4.f32 {%f0, %f1, %f2, %f3}, [%rd0];

Impact: 4x memory bandwidth reduction.

Integration with CI

Add PTX bug detection to your CI pipeline:

# .github/workflows/ptx-analysis.yml
- name: PTX Bug Analysis
  run: |
    cargo run -p trueno-explain --example deep_bug_hunt
    # Fail if any P0 bugs found
    cargo test -p trueno-explain --test ptx_bug_hunting

Writing Bug-Free PTX

Use trueno-gpu kernel generators instead of hand-writing PTX:

use trueno_gpu::kernels::{GemmKernel, Kernel};

// Generated PTX is verified bug-free
let kernel = GemmKernel::tiled(1024, 1024, 1024, 32);
let ptx = kernel.emit_ptx();

// Verify with analyzer
let result = PtxBugAnalyzer::new().analyze(&ptx);
assert!(result.is_valid());

API Reference

PtxBugAnalyzer

impl PtxBugAnalyzer {
    /// Create default analyzer
    pub fn new() -> Self;

    /// Create strict mode analyzer
    pub fn strict() -> Self;

    /// Pre-configured whitelist for quantized kernels
    pub fn with_quantized_whitelist() -> Self;

    /// Add whitelist entry
    pub fn with_whitelist(
        self,
        kernel_pattern: &str,  // e.g., "q4k*"
        bug_class: PtxBugClass,
        reason: &str
    ) -> Self;

    /// Analyze PTX and return report
    pub fn analyze(&self, ptx: &str) -> PtxBugReport;
}

PtxBugReport

impl PtxBugReport {
    /// Check if any bugs found
    pub fn has_bugs(&self) -> bool;

    /// Check for specific bug class
    pub fn has_bug(&self, class: &PtxBugClass) -> bool;

    /// Check if kernel is valid (no P0/P1 bugs)
    pub fn is_valid(&self) -> bool;

    /// Count bugs by severity
    pub fn count_by_severity(&self, severity: BugSeverity) -> usize;

    /// Get formatted report string
    pub fn format_report(&self) -> String;
}

Trueno - High-Performance SIMD/GPU Compute Library