Case Study: Sharded SafeTensors Serving (GH-213)

The Bug

When running apr serve on a sharded SafeTensors model (3B+ parameters), the server crashed with:

SafeTensors header too large

The root cause: start_realizar_server() reads the first 8 bytes of the file for format detection. For a .safetensors.index.json file, those 8 bytes are {"weight — JSON text, not a binary header. The format detector interprets this as a SafeTensors header size (a massive number), triggering DOS protection.

Meanwhile, apr run already handled sharded models correctly via run_sharded_safetensors_inference() in realizar. The serve path simply lacked the same detection.

The Fix

Two changes, following the existing realizar pattern:

1. Early Detection in `handlers.rs`

Before reading any bytes from the file, check if the path ends with .safetensors.index.json:

// GH-213: Detect sharded SafeTensors index.json BEFORE reading file bytes.
let path_str = model_path.to_string_lossy();
if path_str.ends_with(".safetensors.index.json") {
    return super::safetensors::start_sharded_safetensors_server(model_path, config);
}

// ... existing 8-byte format detection continues for non-sharded files

2. Sharded Server Function in `safetensors.rs`

The new start_sharded_safetensors_server() mirrors the single-file start_safetensors_server() but uses:

ShardedSafeTensorsModel::load_from_index() instead of std::fs::read()
SafetensorsConfig::load_from_sibling() for config.json
SafetensorsToAprConverter::convert_sharded() instead of convert()

The rest (tokenizer loading, axum router, handler functions) is shared with the single-file path.

Verification

MVP playbook tests confirmed the fix across all model sizes:

Model	Shards	Serve CPU	Serve GPU
0.5B	1 (single file)	Pass	Pass
3B	2	Pass (was crash)	Pass (was crash)
7B	4	Pass (was crash)	Pass (was crash)
14B	6	Timeout (resource)	Timeout (resource)

The 14B timeouts are a resource limitation (56GB F32 model exceeds the 120s server-readiness timeout), not a code bug.

Lessons

Format detection must handle metadata files. Binary magic-byte detection fails on JSON index files. Check file extensions first for known patterns before falling back to byte-level detection.
Mirror existing patterns. The apr run sharded path in realizar was the reference implementation. The serve fix reuses the same APIs (ShardedSafeTensorsModel, SafetensorsToAprConverter::convert_sharded) rather than reinventing.
Test at every model size. The bug only manifests with sharded models (3B+). Single-file models (0.5B) work fine. Without multi-model testing, this would have been missed.

Bug 205 in the showcase spec
realizar/src/infer/mod.rs:1379 — reference sharded inference implementation
realizar/src/safetensors/mod.rs — ShardedSafeTensorsModel API

Aprender — Pure Rust ML Framework