20. Memory Profiling with dhat-rs
Status: COMPLETE (ALB-099 FIXED). dhat-rs integrated in all 5 repos. 21 memory issues found and fixed across 4 profiling rounds.
0. Motivation & Results
The sovereign stack originally had zero heap profiling instrumentation. dhat-rs profiling (ALB-099) uncovered 21 issues across 5 repos:
Round 1 (binary workloads):
- realizar:
std::fs::readon 17 GB model → peak 19.6 GB. Fixed: early-out for GPU Q4K → 1.3 GB (93% reduction) - realizar: PMAT-045 variable scoping bug (compile error)
- aprender:
validate_ggufdequantized every tensor to f32 → 8.2 GB. Fixed: metadata-only → 6.0 GB - entrenar:
TRACERaccumulated unbounded Vec → ~2.8 GB at 28K steps. Fixed: aggregated HashMap → O(1) memory - alimentar:
cmd_infoloaded entire dataset → 73 MB. Fixed: Parquet footer → 0.2 MB (348x) - trueno:
TunerFeatures::to_vector()→ 140K tiny Vec allocs. Fixed:to_array()→ zero alloc
Round 2 (deeper workloads):
7. trueno: BLIS gemm_blis() allocated 4.3 MB workspace per call. Fixed: thread-local high-water-mark → zero alloc after first call
8. alimentar: Unique::apply() O(rows×cols) String allocs. Fixed: u64 hash keys → zero String allocs
9. aprender+entrenar: APR checkpoint save copied 450 MB weights twice. Fixed: consuming pipeline → 900 MB eliminated
Round 3 (code analysis):
10. aprender: resolve_f32_tied_embeddings cloned BTreeMap (~1.4 GB). Fixed: Cow → only clones when needed
11. aprender: AprV2Writer output Vec no capacity hint. Fixed: with_capacity(total_size)
12. entrenar: ArrowDataset triple materialization (~1 GB/shard). Fixed: scoped drop
13. entrenar: 3 Vec allocations without with_capacity in hot path. Fixed: capacity hints
14. realizar: detect_model() read entire file for 8-byte magic. Fixed: File::open + read(8) → ~0 bytes
15. realizar: run_inference() read entire file for format detection. Fixed: read_exact(8) → ~0 bytes
Round 4 (architectural issues — ALB-100 through ALB-105):
16. entrenar: LMBatch stored input+target separately → 50% waste. Fixed: shared stride-based tokens (ALB-100)
17. entrenar: Eager dataset loading → 37 GB for 19 shards. Fixed: StreamingParquetLoader (ALB-101)
18. realizar: KV cache used hidden_dim not kv_dim → 4-8x over-allocation. Fixed: GQA-aware sizing (ALB-102)
19. realizar: sample_topk allocated 2.4 MB per token. Fixed: in-place masking + O(n) partial sort (ALB-103)
20. aprender: APR reader re-parsed header per tensor read. Fixed: cached data_offset (ALB-104)
21. aprender: APR write_into() buffered entire file in RAM. Fixed: AprV2StreamingWriter (ALB-105)
1. Tool Selection: dhat-rs
dhat-rs (crate: dhat, version 0.3) is a Rust heap profiler that:
- Tracks every allocation (count, size, lifetime)
- Reports peak heap usage, total bytes allocated, allocation hot spots
- Produces DHAT-compatible JSON for visualization in
dh_view.html - Has ~2× runtime overhead (acceptable for profiling builds, not production)
- Zero overhead when disabled (feature-gated, compiles to nothing)
Why not alternatives:
| Tool | Rejection Reason |
|---|---|
| Valgrind DHAT | External tool, not composable with cargo test |
jemalloc + MALLOC_CONF | Non-portable, Linux-only, coarse-grained |
peak_alloc | Only tracks peak RSS, no per-allocation detail |
tikv-jemallocator + stats | Heavy dependency, overkill for profiling |
2. Integration Architecture
2.1 Feature Flag Convention
Every repo uses the same feature flag: dhat-heap.
[features]
dhat-heap = ["dep:dhat"]
[dependencies]
dhat = { version = "0.3", optional = true }
The flag name dhat-heap follows dhat-rs convention and is already recognized
by the crate’s #[cfg(feature = "dhat-heap")] guards.
2.2 Global Allocator Pattern
In each binary crate’s main.rs (or lib.rs for libraries with integration
tests):
#![allow(unused)]
fn main() {
#[cfg(feature = "dhat-heap")]
#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;
}
The profiler is activated at program start:
fn main() {
#[cfg(feature = "dhat-heap")]
let _profiler = dhat::Profiler::new_heap();
// ... existing main logic ...
}
When the _profiler drops (at program exit), dhat-rs writes:
- Summary to stderr (peak heap, total bytes, allocation count)
dhat-heap.jsonto CWD (viewable indh_view.html)
2.3 Per-Repo Integration Points
| Repo | Crate | Entry Point | Purpose |
|---|---|---|---|
| realizar | realizar (bin) | src/main.rs | Inference memory: model load, KV cache, serve |
| aprender | apr-cli (bin) | crates/apr-cli/src/main.rs | CLI memory: import, quantize, eval |
| trueno | trueno (lib) | src/lib.rs + integration tests | Tensor op allocations |
| entrenar | entrenar (bin) | src/main.rs | Training memory: buffers, optimizer, checkpoints |
2.4 Integration Test Pattern
For library crates (trueno), use dhat’s assertion API in tests:
#![allow(unused)]
fn main() {
#[cfg(test)]
#[cfg(feature = "dhat-heap")]
mod dhat_tests {
use super::*;
#[test]
fn test_vector_add_allocations() {
let _profiler = dhat::Profiler::builder().testing().build();
// ... exercise code ...
let stats = dhat::HeapStats::get();
// Assert allocation bounds
dhat::assert!(stats.total_bytes < 1_000_000);
}
}
}
3. Usage
3.1 Profile a Binary
# Profile realizar inference
cd ~/src/realizar
CARGO_TARGET_DIR=/mnt/nvme-raid0/targets/realizar \
cargo run --release --features dhat-heap -- \
run /mnt/nvme-raid0/models/qwen3-coder-30b-q4k.apr "def fib(n):" --gpu --max-tokens 20
# Profile entrenar training
cd ~/src/entrenar
CARGO_TARGET_DIR=/mnt/nvme-raid0/targets/entrenar \
cargo run --release --features "dhat-heap,cuda,parquet" -- \
train --config configs/pretrain-50m-quick.yaml
# Profile apr CLI
cd ~/src/aprender
CARGO_TARGET_DIR=/mnt/nvme-raid0/targets/aprender \
cargo run --release --features dhat-heap -p apr-cli -- \
tensors info /mnt/nvme-raid0/models/qwen3-coder-30b-q4k.apr
3.2 View Results
# dhat-rs writes dhat-heap.json on exit
# Open in browser: https://nnethercote.github.io/dh_view/dh_view.html
# Load dhat-heap.json → tree view of allocation sites
3.3 Run dhat-Gated Tests
# Run tests with allocation assertions
cargo test --features dhat-heap
4. Contracts
Contract: contracts/memory-profiling-v1.yaml (ALB-099)
4.1 Obligations
| ID | Description | Falsification |
|---|---|---|
| C-DHAT-001 | All 4 repos compile with --features dhat-heap | cargo check --features dhat-heap exits 0 |
| C-DHAT-002 | Binary crates produce dhat-heap.json on exit | Run binary, assert file exists |
| C-DHAT-003 | Zero overhead when feature disabled | cargo build --release size unchanged ±1% |
| C-DHAT-004 | Feature flag is dhat-heap in all repos | grep 'dhat-heap' Cargo.toml in all 4 repos |
5. Profiling Results
All originally-planned hotspots have been profiled and fixed:
| Repo | Scenario | Before | After | Issue |
|---|---|---|---|---|
| realizar | Load 17 GB Q4K APR | 19.6 GB peak host RAM | 1.3 GB | ALB-099 #1 |
| realizar | Format detection | Read entire file | 8 bytes | ALB-099 #14-15 |
| realizar | KV cache (GQA) | 4-8x over-allocated | Correct kv_dim | ALB-102 |
| realizar | Top-k sampling | 2.4 MB/token | In-place, O(n) | ALB-103 |
| aprender | APR validation | 8.2 GB (dequant all) | Metadata-only | ALB-099 #3 |
| aprender | APR read | Re-parse per tensor | Cached offset | ALB-104 |
| aprender | APR write | Buffer in RAM | Streaming writer | ALB-105 |
| aprender | Tied embeddings | Clone 1.4 GB BTreeMap | Cow (lazy clone) | ALB-099 #10 |
| entrenar | Dataset loading | 37 GB eager | Streaming per-shard | ALB-101 |
| entrenar | LMBatch storage | 2x tokens (input+target) | Shared stride | ALB-100 |
| entrenar | Tracer memory | O(steps) unbounded | O(1) aggregated | ALB-099 #4 |
| entrenar | Checkpoint save | 900 MB redundant copies | Consuming pipeline | ALB-099 #9 |
| trueno | BLIS workspace | 4.3 MB/call | Thread-local reuse | ALB-099 #7 |
| trueno | TunerFeatures | 140K Vec allocs | Stack array | ALB-099 #6 |
| alimentar | cmd_info | 73 MB | 0.2 MB | ALB-099 #5 |
| alimentar | Unique::apply | O(rows×cols) Strings | u64 hash keys | ALB-099 #8 |
Contract
contracts/memory-profiling-v1.yaml (ALB-099) — 4 obligations, all verified.