PTX Best Practices
This document covers PTX assembly generation best practices learned from development and debugging of trueno-gpu CUDA kernels.
Register Types
U8 Registers Are Not Supported
Issue: PTX does not support 8-bit register types (.u8, .s8).
Incorrect:
.reg .u8 %rs<1>; // ERROR: Invalid register type
ld.global.u8 %rs0, [%rd0];
Correct:
.reg .u16 %rh<1>; // Minimum register size is 16-bit
ld.global.u8 %rh0, [%rd0]; // Load zero-extends to 16-bit
The ld.global.u8 instruction is valid, but it must store into a 16-bit
or larger register. The loaded byte is zero-extended.
Half-Precision (F16) Operations
Loading F16 Values
Issue: PTX uses .b16 (binary 16-bit) for half-precision loads, not .f16.
Incorrect:
ld.global.f16 %h0, [%rd0]; // ERROR: Invalid type for load
Correct:
ld.global.b16 %h0, [%rd0]; // Load 16-bit binary value
F16 to F32 Conversion
Issue: Converting from f16 to f32 is exact and does NOT require a rounding modifier.
Incorrect:
cvt.rn.f32.f16 %f0, %h0; // ERROR: Illegal rounding modifier
Correct:
cvt.f32.f16 %f0, %h0; // No rounding needed (exact conversion)
Note: The reverse conversion (f32 → f16) DOES require a rounding modifier:
cvt.rn.f16.f32 %h0, %f0; // Correct: rounding needed for narrowing
Bitwise Operations
AND, OR, XOR Types
Issue: PTX requires .b32 (binary) type for bitwise operations, not .u32.
Incorrect:
and.u32 %r2, %r0, %r1; // ERROR: Invalid type
or.u32 %r2, %r0, %r1; // ERROR: Invalid type
Correct:
and.b32 %r2, %r0, %r1; // Use .b32 for bitwise ops
or.b32 %r2, %r0, %r1;
xor.b32 %r2, %r0, %r1;
Warp Shuffle Operations
Shuffle Width Parameter
Issue: The width parameter in shfl.sync.idx must be a power of 2 (1, 2, 4, 8, 16, or 32).
Incorrect:
shfl.sync.idx.b32 %f0, %f1, 0, 31, 0xFFFFFFFF; // ERROR: 31 is not power of 2
Correct:
shfl.sync.idx.b32 %f0, %f1, 0, 32, 0xFFFFFFFF; // 32 is valid
Warp Participation
Issue: shfl.sync with mask 0xFFFFFFFF requires ALL 32 threads in the warp
to execute the instruction simultaneously.
If some threads exit early (e.g., via @%p bra exit), the remaining threads
cannot perform shuffles.
Solution: Use address clamping to ensure all threads access valid memory, then skip only the final store for out-of-bounds threads:
// Clamp addresses for all threads
min.u32 %r_clamped_row, %r_global_row, %r_m_minus_1;
min.u32 %r_clamped_col, %r_global_col, %r_n_minus_1;
// All threads participate in computation and shuffles
// ...shuffle reduction code...
// Only in-bounds threads store
@%p_row_oob bra exit;
@%p_col_oob bra exit;
st.global.f32 [%rd_out], %f_result;
exit:
ret;
Memory Alignment
4-Byte Alignment for U32 Loads
Issue: ld.global.u32 requires the address to be 4-byte aligned.
Incorrect:
// If header has 2-byte f16 scale at offset 0, and we try to read
// another u32 at offset 2, it will be misaligned
add.u64 %rd1, %rd0, 2;
ld.global.u32 %r0, [%rd1]; // ERROR: Misaligned access
Correct: Use smaller loads for misaligned data:
ld.global.b16 %rh0, [%rd0]; // Load 2-byte aligned data
Testing PTX
Always validate generated PTX with ptxas:
ptxas -arch=sm_89 -v kernel.ptx -o kernel.cubin
Use compute-sanitizer for runtime memory access checking:
compute-sanitizer --tool memcheck ./your_program
References
- PTX ISA Reference
- GitHub Issue #67 - U8 register bug
- GitHub Issue #68 - F16 load/convert bug