Case Study: APR with JSON Metadata

This case study demonstrates embedding arbitrary JSON metadata (vocabulary, tokenizer config, model settings) alongside tensor data in a single .apr file for WASM-ready deployment.

The Problem

Modern ML models need more than just weights:

Data TypeTraditional ApproachProblem
VocabularySeparate vocab.jsonMultiple files to manage
TokenizerSeparate tokenizer.jsonVersion mismatches
ConfigSeparate config.yamlDeployment complexity
CustomApplication-specific filesN+1 file problem

The Solution: Embedded Metadata

The .apr format supports arbitrary JSON metadata embedded directly in the model file:

use aprender::serialization::apr::{AprWriter, AprReader};
use serde_json::json;

let mut writer = AprWriter::new();

// Embed any JSON metadata
writer.set_metadata("model_type", json!("whisper-tiny"));
writer.set_metadata("n_vocab", json!(51865));
writer.set_metadata("tokenizer", json!({
    "tokens": ["<|endoftext|>", "<|startoftranscript|>"],
    "merges": [["t", "h"], ["th", "e"]],
    "special_tokens": {"eot": 50256, "sot": 50257}
}));

// Add tensors
writer.add_tensor_f32("encoder.weight", vec![384, 80], &weights);

// Single file contains everything
writer.write("model.apr")?;

Complete Example

Run: cargo run --example apr_with_metadata

// Run this example:
//   cargo run --example apr_with_metadata
//
// See the CLI reference and source code in crates/ for implementation details.

Key Features

1. Arbitrary JSON Metadata

Any JSON-serializable data can be embedded:

// Strings
writer.set_metadata("model_name", json!("my-model"));

// Numbers
writer.set_metadata("n_layers", json!(12));

// Arrays
writer.set_metadata("supported_languages", json!(["en", "es", "fr"]));

// Objects
writer.set_metadata("config", json!({
    "hidden_size": 768,
    "num_attention_heads": 12
}));

2. Type-Safe Tensor Storage

Tensors are stored with shape information:

writer.add_tensor_f32("layer.0.weight", vec![768, 768], &weights);
writer.add_tensor_f32("layer.0.bias", vec![768], &bias);

3. Single-File Deployment

Perfect for WASM:

// Embed at compile time
const MODEL: &[u8] = include_bytes!("model.apr");

fn inference(input: &[f32]) -> Vec<f32> {
    let reader = AprReader::from_bytes(MODEL.to_vec()).unwrap();

    // Access metadata
    let vocab = reader.get_metadata("tokenizer").unwrap();

    // Access tensors
    let weights = reader.read_tensor_f32("encoder.weight").unwrap();

    // ... inference logic
}

Use Cases

Speech Recognition (Whisper-style)

writer.set_metadata("tokenizer", json!({
    "tokens": vocab_tokens,
    "merges": bpe_merges,
    "special_tokens": {
        "eot": 50256,
        "sot": 50257,
        "transcribe": 50358,
        "translate": 50359
    }
}));

Language Models

writer.set_metadata("tokenizer", json!({
    "type": "BPE",
    "vocab_size": 32000,
    "pad_token": "<pad>",
    "eos_token": "</s>",
    "bos_token": "<s>"
}));

Custom Models

writer.set_metadata("preprocessing", json!({
    "mean": [0.485, 0.456, 0.406],
    "std": [0.229, 0.224, 0.225],
    "input_size": [224, 224]
}));

Audio Models (Mel Filterbank)

For speech recognition models like Whisper, embedding the exact mel filterbank used during training is critical for correct transcription. Computing filterbanks at runtime produces different values due to normalization differences.

// Store filterbank as a named tensor (most efficient for 64KB+ data)
writer.add_tensor_f32(
    "audio.mel_filterbank",
    vec![80, 201],  // n_mels x n_freqs
    &filterbank_data,
);

// Store audio preprocessing config in metadata
writer.set_metadata("audio", json!({
    "sample_rate": 16000,
    "n_fft": 400,
    "hop_length": 160,
    "n_mels": 80
}));

Reading back:

let reader = AprReader::from_bytes(model_bytes)?;

// Read filterbank tensor
let filterbank = reader.read_tensor_f32("audio.mel_filterbank")?;

// Get audio config
let audio_config = reader.get_metadata("audio").unwrap();
let n_mels = audio_config["n_mels"].as_u64().unwrap() as usize;
let n_freqs = filterbank.len() / n_mels;

// Use filterbank for mel spectrogram computation
let mel_spectrogram = compute_mel(&audio_samples, &filterbank, n_mels, n_freqs);

Why this matters: Whisper was trained with librosa's slaney-normalized filterbank where row sums are ~0.025. Computing from scratch produces peak-normalized filterbanks with row sums of ~1.0+. This mismatch causes the "rererer" hallucination bug.

Benefits

BenefitDescription
Single fileNo more managing multiple files
Version-lockedMetadata travels with weights
WASM-readyEmbed entire model in binary
Type-safeCRC32 checksum for integrity
FlexibleAny JSON structure supported

Binary Data: Metadata vs Tensor

When storing binary data (filterbanks, embeddings), choose the right approach:

Data SizeJSON MetadataNamed Tensor
< 100KBPreferredOverkill
100KB - 1MBAcceptableRecommended
> 1MBAvoid (slow JSON parsing)Required

Mel filterbank (64KB): Both work; tensor is more efficient.

Vocabulary (1-5MB): Use JSON for string arrays, tensor for embedding matrices.

Large embeddings (>10MB): Always use tensors.