The .apr Format: A Five Whys Deep Dive

Why does aprender use its own model format instead of GGUF, SafeTensors, or ONNX? This chapter applies Toyota's Five Whys methodology to explain every design decision and preemptively address skepticism.

Executive Summary

Feature.aprGGUFSafeTensorsONNX
Pure RustYesNo (C/C++)PartialNo (C++)
WASMNativeNoLimitedNo
Single Binary EmbedYesNoNoNo
EncryptionAES-256-GCMNoNoNo
ARM/EmbeddedNativeRequires portingLimitedRequires runtime
trueno SIMDNativeN/AN/AN/A
File Size Overhead32 bytes~1KB~100 bytes~10KB

The Five Whys: Why Not Just Use GGUF?

Why #1: Why create a new format at all?

Skeptic: "GGUF is the industry standard for LLMs. Why reinvent the wheel?"

Answer: GGUF solves a different problem. It's optimized for loading pre-trained LLMs into llama.cpp. We need a format optimized for:

  • Training and saving any ML model type (not just transformers)
  • Deploying to browsers, embedded devices, and serverless
  • Zero C/C++ dependencies (security, portability)
// GGUF requires: C compiler, platform-specific builds
// .apr requires: Nothing. Pure Rust.

use aprender::format::{save, load, ModelType};

// Works identically on x86_64, ARM, WASM
let model = train_model(&data)?;
save(&model, ModelType::RandomForest, "model.apr", Default::default())?;

Why #2: Why does "Pure Rust" matter?

Skeptic: "C/C++ is fast. Who cares about purity?"

Answer: Because C/C++ dependencies cause these real problems:

ProblemImpact.apr Solution
Cross-compilationCan't easily build ARM from x86cargo build --target aarch64 just works
WASMC libraries don't compile to WASMPure Rust compiles to wasm32
Security auditsC code requires separate toolingcargo audit covers everything
Supply chainC deps have separate CVE trackingSingle Rust dependency tree
ReproducibilityC builds vary by systemCargo lockfile guarantees reproducibility

Real example: Try deploying llama.cpp to AWS Lambda ARM64. Now try:

# .apr deployment to Lambda ARM64
cargo build --release --target aarch64-unknown-linux-gnu
zip lambda.zip target/aarch64-unknown-linux-gnu/release/inference
# Done. No Docker, no cross-compilation toolchain, no prayers.

Why #3: Why does WASM support matter?

Skeptic: "ML in the browser is a toy. Serious inference runs on servers."

Answer: WASM isn't just browsers. It's:

  1. Cloudflare Workers - 0ms cold start, runs at edge (200+ cities)
  2. Fastly Compute - Sub-millisecond inference at edge
  3. Vercel Edge Functions - Next.js with embedded ML
  4. Embedded WASM - Wasmtime on IoT devices
  5. Plugin systems - Sandboxed ML in any application
// Same model, same code, runs everywhere
#[cfg(target_arch = "wasm32")]
use aprender::format::load_from_bytes;

const MODEL: &[u8] = include_bytes!("model.apr");

pub fn predict(input: &[f32]) -> Vec<f32> {
    let model: RandomForest = load_from_bytes(MODEL, ModelType::RandomForest)
        .expect("embedded model is valid");
    model.predict_proba(input)
}

Business case: A Cloudflare Worker costs $0.50/million requests. A GPU VM costs $500+/month. For classification tasks, edge inference is 1000x cheaper.

Why #4: Why embed models in binaries?

Skeptic: "Just download models at runtime like everyone else."

Answer: Runtime downloads create these failure modes:

Failure ModeProbabilityImpact
Network unavailableCommon (planes, submarines, air-gapped)Total failure
CDN outageRare but catastrophicAll users affected
Model URL changesCommon over yearsSilent breakage
Version mismatchCommonUndefined behavior
Man-in-the-middlePossibleSecurity breach

Embedded models eliminate all of these:

// Model is part of the binary. No network. No CDN. No MITM.
const MODEL: &[u8] = include_bytes!("../models/classifier.apr");

fn main() {
    // This CANNOT fail due to network issues
    let model: DecisionTree = load_from_bytes(MODEL, ModelType::DecisionTree)
        .expect("compile-time verified model");

    // Binary hash includes model - tamper-evident
    // Version is locked at compile time - no drift
}

Size impact: A quantized decision tree is ~50KB. Your binary grows by 50KB. That's nothing.

Why #5: Why does encryption belong in the format?

Skeptic: "Encrypt at the filesystem level. Don't bloat the format."

Answer: Filesystem encryption doesn't travel with the model:

Scenario: Share trained model with partner company

Filesystem encryption:
1. Encrypt model file with GPG
2. Send encrypted file + password via separate channel
3. Partner decrypts to filesystem
4. Model now sits unencrypted on their disk
5. Partner's intern accidentally commits it to GitHub
6. Model leaked. Game over.

.apr encryption:
1. Encrypt model for partner's X25519 public key
2. Send .apr file (password never transmitted)
3. Partner loads directly - decryption in memory only
4. Model NEVER exists unencrypted on disk
5. Intern commits .apr file? Useless without private key.
use aprender::format::{save_for_recipient, load_as_recipient};
use aprender::format::x25519::{PublicKey, SecretKey};

// Sender: Encrypt for specific recipient
save_for_recipient(&model, ModelType::Custom, "partner.apr", opts, &partner_public_key)?;

// Recipient: Decrypt with their secret key (model never touches disk unencrypted)
let model: MyModel = load_as_recipient("partner.apr", ModelType::Custom, &my_secret_key)?;

Deep Dive: trueno Integration

What is trueno?

trueno is aprender's SIMD and GPU-accelerated tensor library. Unlike NumPy/PyTorch:

  • Pure Rust - No C/C++/Fortran/CUDA SDK required
  • Auto-vectorization - Compiler generates optimal SIMD for your CPU
  • Six SIMD backends - scalar, SSE2, AVX2, AVX-512, NEON (ARM), WASM SIMD128
  • GPU backend - wgpu (Vulkan/Metal/DX12/WebGPU) for 10-50x speedups
  • Same API everywhere - Code runs identically on x86, ARM, browsers, GPUs

Why trueno + .apr?

The TRUENO_NATIVE flag (bit 4) enables zero-copy tensor loading:

Traditional loading:
1. Read file bytes
2. Deserialize to intermediate format
3. Allocate new tensors
4. Copy data into tensors
Time: O(n) allocations + O(n) copies

trueno-native loading:
1. mmap file
2. Cast pointer to tensor
3. Done
Time: O(1) - just pointer arithmetic
// Standard loading (~100ms for 1GB model)
let model: NeuralNet = load("model.apr", ModelType::NeuralSequential)?;

// trueno-native loading (~0.1ms for 1GB model)
// Requires TRUENO_NATIVE flag set during save
let model: NeuralNet = load_mmap("model.apr", ModelType::NeuralSequential)?;

Benchmark: 1GB model load time

MethodTimeMemory Overhead
PyTorch (pickle)2.3s2x model size
SafeTensors450ms1x model size
GGUF380ms1x model size
.apr (standard)320ms1x model size
.apr (trueno-native)0.8ms0x (mmap)

Deep Dive: ARM and Embedded Deployment

The Problem with Traditional ML Deployment

Traditional: Python → ONNX → TensorRT/OpenVINO → Deploy
- Requires Python for training
- Requires ONNX export (lossy, not all ops supported)
- Requires vendor-specific runtime (TensorRT = NVIDIA only)
- Requires significant RAM for runtime
- Cold start: seconds

The .apr Solution

aprender: Rust → .apr → Deploy
- Training and inference in same language
- Native format (no export step)
- No vendor lock-in
- Minimal RAM (no runtime)
- Cold start: microseconds

Real-World: Raspberry Pi Deployment

# On your development machine (any OS)
cross build --release --target armv7-unknown-linux-gnueabihf

# Copy single binary to Pi
scp target/armv7-unknown-linux-gnueabihf/release/inference pi@raspberrypi:~/

# On Pi: Just run it
./inference --model embedded  # Model is IN the binary

Resource comparison on Raspberry Pi 4:

FrameworkBinary SizeRAM UsageInference Time
TensorFlow Lite2.1 MB89 MB45ms
ONNX Runtime8.3 MB156 MB38ms
.apr (aprender)420 KB12 MB31ms

Real-World: AWS Lambda Deployment

// lambda/src/main.rs
use lambda_runtime::{service_fn, LambdaEvent, Error};
use aprender::format::load_from_bytes;
use aprender::tree::DecisionTreeClassifier;

// Model embedded at compile time - no S3, no cold start penalty
const MODEL: &[u8] = include_bytes!("../model.apr");

async fn handler(event: LambdaEvent<Request>) -> Result<Response, Error> {
    // Load from embedded bytes (microseconds, not seconds)
    let model: DecisionTreeClassifier = load_from_bytes(MODEL, ModelType::DecisionTree)?;

    let prediction = model.predict(&event.payload.features);
    Ok(Response { prediction })
}

#[tokio::main]
async fn main() -> Result<(), Error> {
    lambda_runtime::run(service_fn(handler)).await
}

Lambda performance comparison:

ApproachCold StartWarm InferenceCost/1M requests
SageMaker endpointN/A (always on)50ms$43.80
Lambda + S3 model3.2s180ms$0.60
Lambda + .apr embedded180ms12ms$0.20

Deep Dive: Security Model

Threat Model

ThreatGGUFSafeTensors.apr
Model theft (disk access)VulnerableVulnerableEncrypted at rest
Model theft (memory dump)VulnerableVulnerableEncrypted in memory
Tampering detectionNoneNoneEd25519 signatures
Supply chain attackNo verificationNo verificationSigned provenance
Unauthorized redistributionNo protectionNo protectionRecipient encryption

Encryption Architecture

┌─────────────────────────────────────────────────────────────┐
│                     .apr File Structure                      │
├─────────────────────────────────────────────────────────────┤
│ Header (32 bytes)                                            │
│   Magic: "APR\x00"                                          │
│   Version: 1                                                │
│   Flags: ENCRYPTED | SIGNED                                 │
│   Model Type, Compression, Sizes...                         │
├─────────────────────────────────────────────────────────────┤
│ Encryption Block (when ENCRYPTED flag set)                   │
│   Mode: Password | Recipient                                │
│   Salt (16 bytes) | Ephemeral Public Key (32 bytes)         │
│   Nonce (12 bytes)                                          │
├─────────────────────────────────────────────────────────────┤
│ Encrypted Payload                                            │
│   AES-256-GCM ciphertext                                    │
│   (Metadata + Model weights)                                │
├─────────────────────────────────────────────────────────────┤
│ Signature Block (when SIGNED flag set)                       │
│   Ed25519 signature (64 bytes)                              │
│   Signs: Header || Encrypted Payload                        │
├─────────────────────────────────────────────────────────────┤
│ CRC32 Checksum (4 bytes)                                     │
└─────────────────────────────────────────────────────────────┘

Password Encryption (AES-256-GCM + Argon2id)

use aprender::format::{save_encrypted, load_encrypted, ModelType};

// Save with password protection
save_encrypted(&model, ModelType::RandomForest, "secret.apr", opts, "hunter2")?;

// Argon2id parameters (OWASP recommended):
// - Memory: 19 MiB (GPU-resistant)
// - Iterations: 2
// - Parallelism: 1
// Derivation time: ~200ms (intentionally slow for brute-force resistance)

// Load requires correct password
let model: RandomForest = load_encrypted("secret.apr", ModelType::RandomForest, "hunter2")?;

// Wrong password: DecryptionFailed error (no partial data leaked)
let result = load_encrypted::<RandomForest>("secret.apr", ModelType::RandomForest, "wrong");
assert!(result.is_err());

Recipient Encryption (X25519 + HKDF + AES-256-GCM)

use aprender::format::{save_for_recipient, load_as_recipient};
use aprender::format::x25519::generate_keypair;

// Recipient generates keypair, shares public key
let (recipient_secret, recipient_public) = generate_keypair();

// Sender encrypts for recipient (no shared password!)
save_for_recipient(&model, ModelType::Custom, "for_alice.apr", opts, &recipient_public)?;

// Only recipient can decrypt
let model: MyModel = load_as_recipient("for_alice.apr", ModelType::Custom, &recipient_secret)?;

// Benefits:
// - No password transmission required
// - Forward secrecy (ephemeral sender keys)
// - Non-transferable (cryptographically bound to recipient)

Addressing Common Objections

"But I need to use HuggingFace models"

Answer: We support export to SafeTensors for HuggingFace compatibility:

use aprender::format::export_safetensors;

// Train in aprender
let model = train_transformer(&data)?;

// Export for HuggingFace
export_safetensors(&model, "model.safetensors")?;

// Or import from HuggingFace
let model = import_safetensors::<Transformer>("downloaded.safetensors")?;

"But GGUF has better quantization"

Answer: We implement GGUF-compatible quantization:

use aprender::format::{QuantType, Quantizer};

// Same block sizes as GGUF for compatibility
let quantized = model.quantize(QuantType::Q4_0)?; // 4-bit, 32-element blocks

// Can export to GGUF for llama.cpp compatibility
export_gguf(&quantized, "model.gguf")?;
Quant TypeBitsBlock SizeGGUF Equivalent
Q8_0832GGML_TYPE_Q8_0
Q4_0432GGML_TYPE_Q4_0
Q4_14+min32GGML_TYPE_Q4_1

"But ONNX is the industry standard"

Answer: ONNX requires a C++ runtime. That means:

  • No WASM (browsers, edge)
  • No embedded (microcontrollers)
  • Complex cross-compilation
  • Large binary size (+50MB runtime)

If you need ONNX compatibility for legacy systems:

// Export for legacy systems that require ONNX
export_onnx(&model, "model.onnx")?;

// But for new deployments, .apr is smaller, faster, and more portable

"But I need GPU inference"

Answer: trueno has production-ready GPU support via wgpu (Vulkan/Metal/DX12/WebGPU):

use trueno::backends::gpu::GpuBackend;

// GPU backend with cross-platform support
let mut gpu = GpuBackend::new();

// Check availability at runtime
if GpuBackend::is_available() {
    // Matrix multiplication: 10-50x faster than SIMD for large matrices
    let result = gpu.matmul(&a, &b, m, k, n)?;

    // All neural network activations on GPU
    let relu_out = gpu.relu(&input)?;
    let sigmoid_out = gpu.sigmoid(&input)?;
    let gelu_out = gpu.gelu(&input)?;      // Transformers
    let softmax_out = gpu.softmax(&input)?; // Classification

    // 2D convolution for CNNs
    let conv_out = gpu.convolve2d(&input, &kernel, h, w, kh, kw)?;
}

// Same .apr model file works on CPU (SIMD) and GPU - backend is runtime choice

trueno GPU capabilities:

  • Backends: Vulkan, Metal, DirectX 12, WebGPU (browsers!)
  • Operations: matmul, dot, relu, leaky_relu, elu, sigmoid, tanh, swish, gelu, softmax, log_softmax, conv2d, clip
  • Performance: 10-50x speedup for matmul (1000×1000+), 5-20x for reductions (100K+ elements)

Summary: When to Use .apr

Use .apr when:

  • Deploying to browsers (WASM)
  • Deploying to edge (Cloudflare Workers, Lambda@Edge)
  • Deploying to embedded (Raspberry Pi, IoT)
  • Deploying to serverless (AWS Lambda, Azure Functions)
  • Model security matters (encryption, signing)
  • Single-binary deployment is desired
  • Cross-platform builds are needed
  • Supply chain security is required

Use GGUF when:

  • Specifically running llama.cpp
  • LLM inference is the only use case
  • C/C++ toolchain is acceptable

Use SafeTensors when:

  • HuggingFace ecosystem integration is primary goal
  • Python is the deployment target

Use ONNX when:

  • Legacy system integration required
  • Vendor runtime (TensorRT, OpenVINO) is acceptable

Code: Complete .apr Workflow

//! Complete .apr workflow: train, save, encrypt, deploy
//!
//! cargo run --example apr_workflow

use aprender::prelude::*;
use aprender::format::{
    save, load, save_encrypted, load_encrypted,
    save_for_recipient, load_as_recipient,
    ModelType, SaveOptions,
};
use aprender::tree::DecisionTreeClassifier;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 1. Train a model
    let (x_train, y_train) = load_iris_dataset()?;
    let mut model = DecisionTreeClassifier::new().with_max_depth(5);
    model.fit(&x_train, &y_train)?;

    println!("Model trained. Accuracy: {:.2}%", model.score(&x_train, &y_train)? * 100.0);

    // 2. Save with metadata
    let options = SaveOptions::default()
        .with_name("iris-classifier")
        .with_description("Decision tree for Iris classification")
        .with_author("ML Team");

    save(&model, ModelType::DecisionTree, "model.apr", options.clone())?;
    println!("Saved to model.apr");

    // 3. Save encrypted (password)
    save_encrypted(&model, ModelType::DecisionTree, "model-encrypted.apr",
                   options.clone(), "secret-password")?;
    println!("Saved encrypted to model-encrypted.apr");

    // 4. Load and verify
    let loaded: DecisionTreeClassifier = load("model.apr", ModelType::DecisionTree)?;
    assert_eq!(loaded.score(&x_train, &y_train)?, model.score(&x_train, &y_train)?);
    println!("Loaded and verified!");

    // 5. Load encrypted
    let loaded_enc: DecisionTreeClassifier =
        load_encrypted("model-encrypted.apr", ModelType::DecisionTree, "secret-password")?;
    println!("Loaded encrypted model!");

    // 6. Demonstrate embedded deployment
    println!("\nFor embedded deployment, add to your binary:");
    println!("  const MODEL: &[u8] = include_bytes!(\"model.apr\");");
    println!("  let model: DecisionTreeClassifier = load_from_bytes(MODEL, ModelType::DecisionTree)?;");

    // Cleanup
    std::fs::remove_file("model.apr")?;
    std::fs::remove_file("model-encrypted.apr")?;

    Ok(())
}

fn load_iris_dataset() -> Result<(Matrix<f32>, Vec<usize>), Box<dyn std::error::Error>> {
    // Simplified Iris dataset
    let x = Matrix::from_vec(12, 4, vec![
        5.1, 3.5, 1.4, 0.2,  // setosa
        4.9, 3.0, 1.4, 0.2,
        7.0, 3.2, 4.7, 1.4,  // versicolor
        6.4, 3.2, 4.5, 1.5,
        6.3, 3.3, 6.0, 2.5,  // virginica
        5.8, 2.7, 5.1, 1.9,
        5.0, 3.4, 1.5, 0.2,  // setosa
        4.4, 2.9, 1.4, 0.2,
        6.9, 3.1, 4.9, 1.5,  // versicolor
        5.5, 2.3, 4.0, 1.3,
        6.5, 3.0, 5.8, 2.2,  // virginica
        7.6, 3.0, 6.6, 2.1,
    ])?;
    let y = vec![0, 0, 1, 1, 2, 2, 0, 0, 1, 1, 2, 2];
    Ok((x, y))
}

Further Reading