Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

20. Memory Profiling with dhat-rs

Status: COMPLETE (ALB-099 FIXED). dhat-rs integrated in all 5 repos. 21 memory issues found and fixed across 4 profiling rounds.

0. Motivation & Results

The sovereign stack originally had zero heap profiling instrumentation. dhat-rs profiling (ALB-099) uncovered 21 issues across 5 repos:

Round 1 (binary workloads):

  1. realizar: std::fs::read on 17 GB model → peak 19.6 GB. Fixed: early-out for GPU Q4K → 1.3 GB (93% reduction)
  2. realizar: PMAT-045 variable scoping bug (compile error)
  3. aprender: validate_gguf dequantized every tensor to f32 → 8.2 GB. Fixed: metadata-only → 6.0 GB
  4. entrenar: TRACER accumulated unbounded Vec → ~2.8 GB at 28K steps. Fixed: aggregated HashMap → O(1) memory
  5. alimentar: cmd_info loaded entire dataset → 73 MB. Fixed: Parquet footer → 0.2 MB (348x)
  6. trueno: TunerFeatures::to_vector() → 140K tiny Vec allocs. Fixed: to_array()zero alloc

Round 2 (deeper workloads): 7. trueno: BLIS gemm_blis() allocated 4.3 MB workspace per call. Fixed: thread-local high-water-mark → zero alloc after first call 8. alimentar: Unique::apply() O(rows×cols) String allocs. Fixed: u64 hash keys → zero String allocs 9. aprender+entrenar: APR checkpoint save copied 450 MB weights twice. Fixed: consuming pipeline → 900 MB eliminated

Round 3 (code analysis): 10. aprender: resolve_f32_tied_embeddings cloned BTreeMap (~1.4 GB). Fixed: Cow → only clones when needed 11. aprender: AprV2Writer output Vec no capacity hint. Fixed: with_capacity(total_size) 12. entrenar: ArrowDataset triple materialization (~1 GB/shard). Fixed: scoped drop 13. entrenar: 3 Vec allocations without with_capacity in hot path. Fixed: capacity hints 14. realizar: detect_model() read entire file for 8-byte magic. Fixed: File::open + read(8)~0 bytes 15. realizar: run_inference() read entire file for format detection. Fixed: read_exact(8)~0 bytes

Round 4 (architectural issues — ALB-100 through ALB-105): 16. entrenar: LMBatch stored input+target separately → 50% waste. Fixed: shared stride-based tokens (ALB-100) 17. entrenar: Eager dataset loading → 37 GB for 19 shards. Fixed: StreamingParquetLoader (ALB-101) 18. realizar: KV cache used hidden_dim not kv_dim → 4-8x over-allocation. Fixed: GQA-aware sizing (ALB-102) 19. realizar: sample_topk allocated 2.4 MB per token. Fixed: in-place masking + O(n) partial sort (ALB-103) 20. aprender: APR reader re-parsed header per tensor read. Fixed: cached data_offset (ALB-104) 21. aprender: APR write_into() buffered entire file in RAM. Fixed: AprV2StreamingWriter (ALB-105)

1. Tool Selection: dhat-rs

dhat-rs (crate: dhat, version 0.3) is a Rust heap profiler that:

  • Tracks every allocation (count, size, lifetime)
  • Reports peak heap usage, total bytes allocated, allocation hot spots
  • Produces DHAT-compatible JSON for visualization in dh_view.html
  • Has ~2× runtime overhead (acceptable for profiling builds, not production)
  • Zero overhead when disabled (feature-gated, compiles to nothing)

Why not alternatives:

ToolRejection Reason
Valgrind DHATExternal tool, not composable with cargo test
jemalloc + MALLOC_CONFNon-portable, Linux-only, coarse-grained
peak_allocOnly tracks peak RSS, no per-allocation detail
tikv-jemallocator + statsHeavy dependency, overkill for profiling

2. Integration Architecture

2.1 Feature Flag Convention

Every repo uses the same feature flag: dhat-heap.

[features]
dhat-heap = ["dep:dhat"]

[dependencies]
dhat = { version = "0.3", optional = true }

The flag name dhat-heap follows dhat-rs convention and is already recognized by the crate’s #[cfg(feature = "dhat-heap")] guards.

2.2 Global Allocator Pattern

In each binary crate’s main.rs (or lib.rs for libraries with integration tests):

#![allow(unused)]
fn main() {
#[cfg(feature = "dhat-heap")]
#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;
}

The profiler is activated at program start:

fn main() {
    #[cfg(feature = "dhat-heap")]
    let _profiler = dhat::Profiler::new_heap();

    // ... existing main logic ...
}

When the _profiler drops (at program exit), dhat-rs writes:

  • Summary to stderr (peak heap, total bytes, allocation count)
  • dhat-heap.json to CWD (viewable in dh_view.html)

2.3 Per-Repo Integration Points

RepoCrateEntry PointPurpose
realizarrealizar (bin)src/main.rsInference memory: model load, KV cache, serve
aprenderapr-cli (bin)crates/apr-cli/src/main.rsCLI memory: import, quantize, eval
truenotrueno (lib)src/lib.rs + integration testsTensor op allocations
entrenarentrenar (bin)src/main.rsTraining memory: buffers, optimizer, checkpoints

2.4 Integration Test Pattern

For library crates (trueno), use dhat’s assertion API in tests:

#![allow(unused)]
fn main() {
#[cfg(test)]
#[cfg(feature = "dhat-heap")]
mod dhat_tests {
    use super::*;

    #[test]
    fn test_vector_add_allocations() {
        let _profiler = dhat::Profiler::builder().testing().build();
        // ... exercise code ...
        let stats = dhat::HeapStats::get();
        // Assert allocation bounds
        dhat::assert!(stats.total_bytes < 1_000_000);
    }
}
}

3. Usage

3.1 Profile a Binary

# Profile realizar inference
cd ~/src/realizar
CARGO_TARGET_DIR=/mnt/nvme-raid0/targets/realizar \
  cargo run --release --features dhat-heap -- \
  run /mnt/nvme-raid0/models/qwen3-coder-30b-q4k.apr "def fib(n):" --gpu --max-tokens 20

# Profile entrenar training
cd ~/src/entrenar
CARGO_TARGET_DIR=/mnt/nvme-raid0/targets/entrenar \
  cargo run --release --features "dhat-heap,cuda,parquet" -- \
  train --config configs/pretrain-50m-quick.yaml

# Profile apr CLI
cd ~/src/aprender
CARGO_TARGET_DIR=/mnt/nvme-raid0/targets/aprender \
  cargo run --release --features dhat-heap -p apr-cli -- \
  tensors info /mnt/nvme-raid0/models/qwen3-coder-30b-q4k.apr

3.2 View Results

# dhat-rs writes dhat-heap.json on exit
# Open in browser: https://nnethercote.github.io/dh_view/dh_view.html
# Load dhat-heap.json → tree view of allocation sites

3.3 Run dhat-Gated Tests

# Run tests with allocation assertions
cargo test --features dhat-heap

4. Contracts

Contract: contracts/memory-profiling-v1.yaml (ALB-099)

4.1 Obligations

IDDescriptionFalsification
C-DHAT-001All 4 repos compile with --features dhat-heapcargo check --features dhat-heap exits 0
C-DHAT-002Binary crates produce dhat-heap.json on exitRun binary, assert file exists
C-DHAT-003Zero overhead when feature disabledcargo build --release size unchanged ±1%
C-DHAT-004Feature flag is dhat-heap in all reposgrep 'dhat-heap' Cargo.toml in all 4 repos

5. Profiling Results

All originally-planned hotspots have been profiled and fixed:

RepoScenarioBeforeAfterIssue
realizarLoad 17 GB Q4K APR19.6 GB peak host RAM1.3 GBALB-099 #1
realizarFormat detectionRead entire file8 bytesALB-099 #14-15
realizarKV cache (GQA)4-8x over-allocatedCorrect kv_dimALB-102
realizarTop-k sampling2.4 MB/tokenIn-place, O(n)ALB-103
aprenderAPR validation8.2 GB (dequant all)Metadata-onlyALB-099 #3
aprenderAPR readRe-parse per tensorCached offsetALB-104
aprenderAPR writeBuffer in RAMStreaming writerALB-105
aprenderTied embeddingsClone 1.4 GB BTreeMapCow (lazy clone)ALB-099 #10
entrenarDataset loading37 GB eagerStreaming per-shardALB-101
entrenarLMBatch storage2x tokens (input+target)Shared strideALB-100
entrenarTracer memoryO(steps) unboundedO(1) aggregatedALB-099 #4
entrenarCheckpoint save900 MB redundant copiesConsuming pipelineALB-099 #9
truenoBLIS workspace4.3 MB/callThread-local reuseALB-099 #7
truenoTunerFeatures140K Vec allocsStack arrayALB-099 #6
alimentarcmd_info73 MB0.2 MBALB-099 #5
alimentarUnique::applyO(rows×cols) Stringsu64 hash keysALB-099 #8

Contract

contracts/memory-profiling-v1.yaml (ALB-099) — 4 obligations, all verified.