PTX Bug Detection
The trueno-explain crate provides static analysis for PTX (NVIDIA GPU assembly) to detect common bugs and performance issues before runtime.
Overview
Hand-written PTX is error-prone. The PTX bug analyzer catches:
| Severity | Bug Class | Description |
|---|---|---|
| P0 Critical | SHARED_MEM_U64 | 64-bit addressing for shared memory (undefined behavior) |
| P0 Critical | MISSING_BARRIER | Missing bar.sync between shared memory operations |
| P0 Critical | LOOP_BRANCH_END | Unconditional branch to loop end (infinite loop) |
| P1 High | HIGH_REG_PRESSURE | >64 registers per thread (reduces occupancy) |
| P1 High | PRED_OVERFLOW | >8 predicates (causes spills) |
| P1 High | PLACEHOLDER_CODE | Incomplete code ("simplified", "omitted" comments) |
| P1 High | EMPTY_LOOP | Loop without computation |
| P1 High | NO_BOUNDS_CHECK | Missing thread bounds check |
| P1 High | REG_SPILLS | .local memory usage (register spills) |
| P2 Medium | DEAD_CODE | Unreachable code after ret/bra |
| P2 Medium | UNOPT_MEM | Non-vectorized memory access |
| P2 Medium | REDUNDANT_MOVES | Redundant register moves |
Quick Start
use trueno_explain::{PtxBugAnalyzer, BugSeverity};
// Analyze PTX string
let ptx = include_str!("kernel.ptx");
let result = PtxBugAnalyzer::new().analyze(ptx);
// Check for bugs
if result.has_bugs() {
println!("{}", result.format_report());
}
// Check specific severity
let critical = result.count_by_severity(BugSeverity::Critical);
assert_eq!(critical, 0, "No P0 bugs allowed!");
Analyzer Modes
Default Mode
Standard analysis - catches obvious bugs:
let analyzer = PtxBugAnalyzer::new();
let result = analyzer.analyze(ptx);
Strict Mode
Catches more potential issues (may have false positives):
let analyzer = PtxBugAnalyzer::strict();
let result = analyzer.analyze(ptx);
With Whitelist
Suppress known acceptable warnings:
use trueno_explain::PtxBugClass;
let analyzer = PtxBugAnalyzer::new()
.with_whitelist("tensor_core*", PtxBugClass::HighRegisterPressure,
"Tensor core kernels need high registers");
Quantized Kernel Whitelist
Pre-configured for quantized kernels (q4k, q5k, q6k, q8k):
// Suppresses HighRegisterPressure for quantized kernels
let analyzer = PtxBugAnalyzer::with_quantized_whitelist();
Examples
Run Deep Bug Hunt
Analyze all trueno-gpu kernels:
cargo run -p trueno-explain --example deep_bug_hunt
Output:
SUMMARY: 30 kernels analyzed
Total bugs: 16
P0 Critical: 0
P1 High: 16
P2 Medium: 0
BUGS BY CLASS:
HIGH_REG_PRESSURE : 16
Analyze External PTX
Analyze hand-rolled PTX from another project:
cargo run -p trueno-explain --example analyze_realizar
Output:
REALIZAR PTX SUMMARY
Files analyzed: 4
Total bugs: 18
P0 Critical: 0
P1 High: 15
P2 Medium: 3
Inspect PTX Details
Deep dive into specific kernel PTX:
cargo run -p trueno-explain --example ptx_inspector
Bug Classes in Detail
P0 Critical - Correctness Bugs
SharedMemU64Addressing
Problem: Using 64-bit registers for shared memory addressing.
// BAD: %rd0 is 64-bit
st.shared.f32 [%rd0], %f0;
// GOOD: %r0 is 32-bit
st.shared.f32 [%r0], %f0;
Impact: Undefined behavior, potential silent corruption.
MissingBarrierSync
Problem: No bar.sync between shared memory write and read.
// BAD: Race condition!
st.shared.f32 [%r0], %f0;
ld.shared.f32 %f1, [%r1]; // May read stale data
// GOOD: Barrier ensures visibility
st.shared.f32 [%r0], %f0;
bar.sync 0;
ld.shared.f32 %f1, [%r1];
Impact: Race condition, non-deterministic results.
P1 High - Performance Bugs
HighRegisterPressure
Problem: >64 registers per thread reduces occupancy.
Register count: 120
Max occupancy: 65536 / (120 * 32) = 17 warps/SM (53%)
Impact: Reduced parallelism, lower throughput.
Fix: Reduce live variables, split kernel, or accept lower occupancy for compute-bound kernels.
PlaceholderCode
Problem: Comments indicate incomplete implementation.
// Detected patterns:
// "simplified"
// "omitted"
// "placeholder"
// "for now"
// "TODO"
Impact: Kernel may produce incorrect results or have missing functionality.
P2 Medium - Optimization Opportunities
DeadCode
Problem: Unreachable code after unconditional branch/return.
// BAD: add.f32 is unreachable
ret;
add.f32 %f0, %f1, %f2;
// BAD: mul.f32 is unreachable
bra skip;
mul.f32 %f0, %f1, %f2;
skip:
Impact: Code bloat, wasted compilation time.
UnoptimizedMemoryPattern
Problem: Multiple single-element loads that could be vectorized.
// BAD: 4 separate loads
ld.global.f32 %f0, [%rd0];
ld.global.f32 %f1, [%rd0+4];
ld.global.f32 %f2, [%rd0+8];
ld.global.f32 %f3, [%rd0+12];
// GOOD: Single vectorized load
ld.global.v4.f32 {%f0, %f1, %f2, %f3}, [%rd0];
Impact: 4x memory bandwidth reduction.
Integration with CI
Add PTX bug detection to your CI pipeline:
# .github/workflows/ptx-analysis.yml
- name: PTX Bug Analysis
run: |
cargo run -p trueno-explain --example deep_bug_hunt
# Fail if any P0 bugs found
cargo test -p trueno-explain --test ptx_bug_hunting
Writing Bug-Free PTX
Use trueno-gpu kernel generators instead of hand-writing PTX:
use trueno_gpu::kernels::{GemmKernel, Kernel};
// Generated PTX is verified bug-free
let kernel = GemmKernel::tiled(1024, 1024, 1024, 32);
let ptx = kernel.emit_ptx();
// Verify with analyzer
let result = PtxBugAnalyzer::new().analyze(&ptx);
assert!(result.is_valid());
API Reference
PtxBugAnalyzer
impl PtxBugAnalyzer {
/// Create default analyzer
pub fn new() -> Self;
/// Create strict mode analyzer
pub fn strict() -> Self;
/// Pre-configured whitelist for quantized kernels
pub fn with_quantized_whitelist() -> Self;
/// Add whitelist entry
pub fn with_whitelist(
self,
kernel_pattern: &str, // e.g., "q4k*"
bug_class: PtxBugClass,
reason: &str
) -> Self;
/// Analyze PTX and return report
pub fn analyze(&self, ptx: &str) -> PtxBugReport;
}
PtxBugReport
impl PtxBugReport {
/// Check if any bugs found
pub fn has_bugs(&self) -> bool;
/// Check for specific bug class
pub fn has_bug(&self, class: &PtxBugClass) -> bool;
/// Check if kernel is valid (no P0/P1 bugs)
pub fn is_valid(&self) -> bool;
/// Count bugs by severity
pub fn count_by_severity(&self, severity: BugSeverity) -> usize;
/// Get formatted report string
pub fn format_report(&self) -> String;
}