Case Study: Sharded SafeTensors Serving (GH-213)

The Bug

When running apr serve on a sharded SafeTensors model (3B+ parameters), the server crashed with:

SafeTensors header too large

The root cause: start_realizar_server() reads the first 8 bytes of the file for format detection. For a .safetensors.index.json file, those 8 bytes are {"weight — JSON text, not a binary header. The format detector interprets this as a SafeTensors header size (a massive number), triggering DOS protection.

Meanwhile, apr run already handled sharded models correctly via run_sharded_safetensors_inference() in realizar. The serve path simply lacked the same detection.

The Fix

Two changes, following the existing realizar pattern:

1. Early Detection in handlers.rs

Before reading any bytes from the file, check if the path ends with .safetensors.index.json:

// GH-213: Detect sharded SafeTensors index.json BEFORE reading file bytes.
let path_str = model_path.to_string_lossy();
if path_str.ends_with(".safetensors.index.json") {
    return super::safetensors::start_sharded_safetensors_server(model_path, config);
}

// ... existing 8-byte format detection continues for non-sharded files

2. Sharded Server Function in safetensors.rs

The new start_sharded_safetensors_server() mirrors the single-file start_safetensors_server() but uses:

  • ShardedSafeTensorsModel::load_from_index() instead of std::fs::read()
  • SafetensorsConfig::load_from_sibling() for config.json
  • SafetensorsToAprConverter::convert_sharded() instead of convert()

The rest (tokenizer loading, axum router, handler functions) is shared with the single-file path.

Verification

MVP playbook tests confirmed the fix across all model sizes:

ModelShardsServe CPUServe GPU
0.5B1 (single file)PassPass
3B2Pass (was crash)Pass (was crash)
7B4Pass (was crash)Pass (was crash)
14B6Timeout (resource)Timeout (resource)

The 14B timeouts are a resource limitation (56GB F32 model exceeds the 120s server-readiness timeout), not a code bug.

Lessons

  1. Format detection must handle metadata files. Binary magic-byte detection fails on JSON index files. Check file extensions first for known patterns before falling back to byte-level detection.

  2. Mirror existing patterns. The apr run sharded path in realizar was the reference implementation. The serve fix reuses the same APIs (ShardedSafeTensorsModel, SafetensorsToAprConverter::convert_sharded) rather than reinventing.

  3. Test at every model size. The bug only manifests with sharded models (3B+). Single-file models (0.5B) work fine. Without multi-model testing, this would have been missed.

  • Bug 205 in the showcase spec
  • realizar/src/infer/mod.rs:1379 — reference sharded inference implementation
  • realizar/src/safetensors/mod.rsShardedSafeTensorsModel API