PTX Best Practices

This document covers PTX assembly generation best practices learned from development and debugging of trueno-gpu CUDA kernels.

Register Types

U8 Registers Are Not Supported

Issue: PTX does not support 8-bit register types (.u8, .s8).

Incorrect:

.reg .u8 %rs<1>;  // ERROR: Invalid register type
ld.global.u8 %rs0, [%rd0];

Correct:

.reg .u16 %rh<1>;  // Minimum register size is 16-bit
ld.global.u8 %rh0, [%rd0];  // Load zero-extends to 16-bit

The ld.global.u8 instruction is valid, but it must store into a 16-bit or larger register. The loaded byte is zero-extended.

Half-Precision (F16) Operations

Loading F16 Values

Issue: PTX uses .b16 (binary 16-bit) for half-precision loads, not .f16.

Incorrect:

ld.global.f16 %h0, [%rd0];  // ERROR: Invalid type for load

Correct:

ld.global.b16 %h0, [%rd0];  // Load 16-bit binary value

F16 to F32 Conversion

Issue: Converting from f16 to f32 is exact and does NOT require a rounding modifier.

Incorrect:

cvt.rn.f32.f16 %f0, %h0;  // ERROR: Illegal rounding modifier

Correct:

cvt.f32.f16 %f0, %h0;  // No rounding needed (exact conversion)

Note: The reverse conversion (f32 → f16) DOES require a rounding modifier:

cvt.rn.f16.f32 %h0, %f0;  // Correct: rounding needed for narrowing

Bitwise Operations

AND, OR, XOR Types

Issue: PTX requires .b32 (binary) type for bitwise operations, not .u32.

Incorrect:

and.u32 %r2, %r0, %r1;  // ERROR: Invalid type
or.u32 %r2, %r0, %r1;   // ERROR: Invalid type

Correct:

and.b32 %r2, %r0, %r1;  // Use .b32 for bitwise ops
or.b32 %r2, %r0, %r1;
xor.b32 %r2, %r0, %r1;

Warp Shuffle Operations

Shuffle Width Parameter

Issue: The width parameter in shfl.sync.idx must be a power of 2 (1, 2, 4, 8, 16, or 32).

Incorrect:

shfl.sync.idx.b32 %f0, %f1, 0, 31, 0xFFFFFFFF;  // ERROR: 31 is not power of 2

Correct:

shfl.sync.idx.b32 %f0, %f1, 0, 32, 0xFFFFFFFF;  // 32 is valid

Warp Participation

Issue: shfl.sync with mask 0xFFFFFFFF requires ALL 32 threads in the warp to execute the instruction simultaneously.

If some threads exit early (e.g., via @%p bra exit), the remaining threads cannot perform shuffles.

Solution: Use address clamping to ensure all threads access valid memory, then skip only the final store for out-of-bounds threads:

// Clamp addresses for all threads
min.u32 %r_clamped_row, %r_global_row, %r_m_minus_1;
min.u32 %r_clamped_col, %r_global_col, %r_n_minus_1;

// All threads participate in computation and shuffles
// ...shuffle reduction code...

// Only in-bounds threads store
@%p_row_oob bra exit;
@%p_col_oob bra exit;
st.global.f32 [%rd_out], %f_result;
exit:
    ret;

Memory Alignment

4-Byte Alignment for U32 Loads

Issue: ld.global.u32 requires the address to be 4-byte aligned.

Incorrect:

// If header has 2-byte f16 scale at offset 0, and we try to read
// another u32 at offset 2, it will be misaligned
add.u64 %rd1, %rd0, 2;
ld.global.u32 %r0, [%rd1];  // ERROR: Misaligned access

Correct: Use smaller loads for misaligned data:

ld.global.b16 %rh0, [%rd0];  // Load 2-byte aligned data

Testing PTX

Always validate generated PTX with ptxas:

ptxas -arch=sm_89 -v kernel.ptx -o kernel.cubin

Use compute-sanitizer for runtime memory access checking:

compute-sanitizer --tool memcheck ./your_program

References

PTX ISA Reference
GitHub Issue #67 - U8 register bug
GitHub Issue #68 - F16 load/convert bug

Trueno - High-Performance SIMD/GPU Compute Library