Golden Trace Validation

Status: Integrated (v0.7.0) Tool: Renacer 0.6.2 Purpose: Performance regression detection via syscall tracing

Overview

Golden trace validation uses Renacer (pure Rust syscall tracer) to capture canonical execution traces for Trueno compute examples. These traces serve as performance baselines, enabling:

Regression Detection: Detect performance degradation via syscall count/latency budgets
PCIe Bottleneck Analysis: Identify inefficient GPU memory transfers
Build-Time Assertions: Enforce performance contracts in CI/CD
Root Cause Analysis: Correlate syscalls to Rust source code

Quick Start

1. Install Renacer

cargo install renacer --version 0.6.2

2. Capture Golden Traces

cd /path/to/trueno
./scripts/capture_golden_traces.sh

Output:

✅ Captured: golden_traces/backend_detection.json (0.73ms, 87 syscalls)
✅ Captured: golden_traces/matrix_operations.json (1.56ms, 168 syscalls)
✅ Captured: golden_traces/activation_functions.json (1.30ms, 159 syscalls)
✅ Captured: golden_traces/performance_demo.json (1.51ms, 138 syscalls)
✅ Captured: golden_traces/ml_similarity.json (0.82ms, 109 syscalls)

3. View Trace Summary

cat golden_traces/backend_detection_summary.txt

Example Output:

Syscall Summary:
write:     23 calls  (0.15ms total)
mmap:      13 calls  (0.21ms total)
mprotect:   6 calls  (0.08ms total)
munmap:     5 calls  (0.04ms total)
...
TOTAL:     87 calls  (0.73ms total)

Traced Operations

1. Backend Detection (`backend_detection`)

Purpose: Validate SIMD backend auto-selection (AVX-512 → AVX2 → SSE2 → Scalar)

Performance Budget:

Runtime: <10ms
Syscalls: <100
Memory: <10MB

Actual Performance: ✅

Runtime: 0.73ms (13× faster than budget)
Syscalls: 87
Top syscalls: write (23), mmap (13), mprotect (6)

Trace Capture:

renacer --format json -- ./target/release/examples/backend_detection > backend_detection.json

2. Matrix Operations (`matrix_operations`)

Purpose: Measure SIMD-accelerated matrix multiply and transpose overhead

Performance Budget:

Runtime: <20ms
Syscalls: <200

Actual Performance: ✅

Runtime: 1.56ms (15× faster)
Syscalls: 168

Key Insight: SIMD operations are compute-bound (minimal syscalls)

3. ML Activation Functions (`activation_functions`)

Purpose: Measure SIMD-accelerated activations (ReLU, sigmoid, tanh, GELU, swish)

Performance Budget:

Runtime: <20ms
Syscalls: <200

Actual Performance: ✅

Runtime: 1.30ms
Syscalls: 159

4. Performance Demo (`performance_demo`)

Purpose: Comprehensive benchmark across vector ops, matrix ops, and backend comparisons

Performance Budget:

Runtime: <50ms
Syscalls: <300

Actual Performance: ✅

Runtime: 1.51ms (33× faster)
Syscalls: 138

5. ML Similarity (`ml_similarity`)

Purpose: Measure vector similarity operations (cosine, Euclidean, Manhattan)

Performance Budget:

Runtime: <20ms
Syscalls: <200

Actual Performance: ✅ FASTEST

Runtime: 0.82ms
Syscalls: 109

Why Fast: Heavily optimized SIMD dot product, minimal allocations

Performance Assertions (`renacer.toml`)

Critical Path Latency

[[assertion]]
name = "example_startup_latency"
type = "critical_path"
max_duration_ms = 100
fail_on_violation = true
enabled = true

Rationale: Compute examples should complete quickly. 100ms allows for SIMD initialization and small-scale computations.

Violation Symptoms:

SIMD overhead issues
Unexpected I/O operations
Debug build instead of release

Syscall Budget

[[assertion]]
name = "max_syscall_budget"
type = "span_count"
max_spans = 500
fail_on_violation = true
enabled = true

Rationale: SIMD operations are CPU-bound with minimal syscalls (mostly mmap for allocation). Budget prevents I/O regressions.

Typical Syscalls:

write: stdout output (20-50 calls)
mmap: vector/matrix allocation (10-30 calls)
mprotect: memory permissions (5-10 calls)

Memory Allocation Budget

[[assertion]]
name = "memory_allocation_budget"
type = "memory_usage"
max_bytes = 104857600  # 100MB
tracking_mode = "allocations"
fail_on_violation = true
enabled = true

Rationale: Small examples should have minimal memory footprint. 100MB accommodates matrix allocations and SIMD buffers.

PCIe Bottleneck Detection

[[assertion]]
name = "detect_pcie_bottleneck"
type = "anti_pattern"
pattern = "PCIeBottleneck"
threshold = 0.7
fail_on_violation = false  # Warning only
enabled = true

Pattern Detected: GPU transfer time >> compute time

Symptoms:

Many write/read syscalls to /dev/nvidia*
High ioctl frequency for GPU operations
Transfer overhead dominates (>70% of total time)

Example Warning:

⚠️  PCIe Bottleneck detected (confidence: 85%)
   GPU transfers: 45ms (90% of total time)
   Compute time:   5ms (10% of total time)
   Recommendation: Batch operations, keep data on GPU

Solution:

Batch multiple operations
Keep intermediate results on GPU
Use larger workloads (amortize transfer costs)
Trueno automatically disables GPU for small ops (v0.2.1+)

CI/CD Integration

GitHub Actions Workflow

Add to .github/workflows/ci.yml:

name: Golden Trace Validation

on: [push, pull_request]

jobs:
  validate-traces:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install Rust
        uses: dtolnay/rust-toolchain@stable

      - name: Install Renacer
        run: cargo install renacer --version 0.6.2

      - name: Build Examples (Release)
        run: cargo build --release --examples

      - name: Capture Golden Traces
        run: ./scripts/capture_golden_traces.sh

      - name: Run Performance Assertions
        run: |
          renacer --assert renacer.toml -- ./target/release/examples/backend_detection
          renacer --assert renacer.toml -- ./target/release/examples/matrix_operations
          renacer --assert renacer.toml -- ./target/release/examples/activation_functions

      - name: Upload Traces
        uses: actions/upload-artifact@v3
        with:
          name: golden-traces
          path: golden_traces/

CI Failure Example:

❌ Assertion 'example_startup_latency' FAILED
   Actual: 125ms
   Budget: 100ms
   Regression: +25%

⚠️  Build BLOCKED. SIMD overhead regression detected.

Advanced Usage

1. Source Code Correlation

Map syscalls to Rust source code:

renacer -s -- ./target/release/examples/backend_detection

Output:

write(1, "Backend: AVX2\n", 14) = 14  [src/lib.rs:245]
mmap(...) = 0x7f... [src/vector.rs:89]

Use Case: Identify which code paths trigger GPU initialization or excessive allocations.

2. OpenTelemetry Export

Visualize traces in Jaeger:

# Start Jaeger
docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 4317:4317 \
  -p 16686:16686 \
  jaegertracing/all-in-one:latest

# Export trace
renacer --otlp http://localhost:4317 -- ./target/release/examples/performance_demo

# View in Jaeger UI
open http://localhost:16686

Use Case: Visualize syscall timelines for multi-operation pipelines.

3. Regression Analysis

Compare current execution against baseline:

# Capture current trace
renacer --format json -- ./target/release/examples/backend_detection > current.json

# Compare with golden
diff <(jq '.syscalls | length' golden_traces/backend_detection.json) \
     <(jq '.syscalls | length' current.json)

Expected: No difference in syscall count (±5% tolerance)

4. GPU Workload Analysis

For GPU-enabled builds:

# Build with GPU feature
cargo build --release --examples --features gpu

# Trace GPU example
renacer --format json -- ./target/release/examples/gpu_test > gpu_trace.json

# Filter GPU device operations
jq '.syscalls[] | select(.name == "ioctl" or .name == "write")' gpu_trace.json

Expected: GPU operations show as ioctl calls to /dev/nvidia0

Red Flag: If transfer syscalls dominate, GPU is inefficient for this workload size.

Toyota Way Principles

Andon (Stop the Line)

Implementation: Build-time assertions fail CI on regression

[[assertion]]
fail_on_violation = true  # ← Andon: Stop the CI pipeline

Poka-Yoke (Error-Proofing)

Implementation: Golden traces make expected patterns explicit

# Automated comparison prevents silent regressions
diff golden_traces/backend_detection.json new_trace.json

Jidoka (Autonomation)

Implementation: Automated quality enforcement without manual intervention

# GitHub Actions runs golden trace validation automatically
- name: Validate Performance
  run: ./scripts/capture_golden_traces.sh

Troubleshooting

Issue: Capture script fails with "Binary not found"

Solution:

cargo build --release --examples
./scripts/capture_golden_traces.sh

Issue: Performance regression detected

Diagnosis:

renacer --summary --timing -- ./target/release/examples/backend_detection
cat golden_traces/backend_detection_summary.txt

Common Causes:

Debug build instead of release (cargo build --release)
SIMD features disabled (check RUSTFLAGS)
New dependencies (increase initialization overhead)

Issue: Syscall count regression

Diagnosis:

renacer -- ./target/release/examples/backend_detection > current_trace.txt
diff current_trace.txt golden_traces/backend_detection_summary.txt

Common Causes:

New logging initialization (env_logger, tracing)
Allocator changes (jemalloc → system allocator)
Library updates (different I/O patterns)

Performance Baselines (v0.7.0)

Example	Runtime	Syscalls	Top Syscall	Status
`backend_detection`	0.73ms	87	`write` (23)	✅
`matrix_operations`	1.56ms	168	`write` (45)	✅
`activation_functions`	1.30ms	159	`write` (38)	✅
`performance_demo`	1.51ms	138	`mmap` (25)	✅
`ml_similarity`	0.82ms	109	`write` (28)	✅ FASTEST

Platform: x86_64 Linux, AVX2 backend, Release build

References

Last Updated: 2025-11-23 Renacer Version: 0.6.2 Trueno Version: 0.7.0

Trueno - High-Performance SIMD/GPU Compute Library