Case Study: Sharded SafeTensors Serving (GH-213)
The Bug
When running apr serve on a sharded SafeTensors model (3B+ parameters), the server crashed with:
SafeTensors header too large
The root cause: start_realizar_server() reads the first 8 bytes of the file for format detection. For a .safetensors.index.json file, those 8 bytes are {"weight — JSON text, not a binary header. The format detector interprets this as a SafeTensors header size (a massive number), triggering DOS protection.
Meanwhile, apr run already handled sharded models correctly via run_sharded_safetensors_inference() in realizar. The serve path simply lacked the same detection.
The Fix
Two changes, following the existing realizar pattern:
1. Early Detection in handlers.rs
Before reading any bytes from the file, check if the path ends with .safetensors.index.json:
// GH-213: Detect sharded SafeTensors index.json BEFORE reading file bytes.
let path_str = model_path.to_string_lossy();
if path_str.ends_with(".safetensors.index.json") {
return super::safetensors::start_sharded_safetensors_server(model_path, config);
}
// ... existing 8-byte format detection continues for non-sharded files
2. Sharded Server Function in safetensors.rs
The new start_sharded_safetensors_server() mirrors the single-file start_safetensors_server() but uses:
ShardedSafeTensorsModel::load_from_index()instead ofstd::fs::read()SafetensorsConfig::load_from_sibling()forconfig.jsonSafetensorsToAprConverter::convert_sharded()instead ofconvert()
The rest (tokenizer loading, axum router, handler functions) is shared with the single-file path.
Verification
MVP playbook tests confirmed the fix across all model sizes:
| Model | Shards | Serve CPU | Serve GPU |
|---|---|---|---|
| 0.5B | 1 (single file) | Pass | Pass |
| 3B | 2 | Pass (was crash) | Pass (was crash) |
| 7B | 4 | Pass (was crash) | Pass (was crash) |
| 14B | 6 | Timeout (resource) | Timeout (resource) |
The 14B timeouts are a resource limitation (56GB F32 model exceeds the 120s server-readiness timeout), not a code bug.
Lessons
-
Format detection must handle metadata files. Binary magic-byte detection fails on JSON index files. Check file extensions first for known patterns before falling back to byte-level detection.
-
Mirror existing patterns. The
apr runsharded path in realizar was the reference implementation. The serve fix reuses the same APIs (ShardedSafeTensorsModel,SafetensorsToAprConverter::convert_sharded) rather than reinventing. -
Test at every model size. The bug only manifests with sharded models (3B+). Single-file models (0.5B) work fine. Without multi-model testing, this would have been missed.
Related
- Bug 205 in the showcase spec
realizar/src/infer/mod.rs:1379— reference sharded inference implementationrealizar/src/safetensors/mod.rs—ShardedSafeTensorsModelAPI