Chapter 52: MLOps Model Quality (CB-1000 to CB-1008)

The CB-1000 series detects quality and metadata defects in ML model binary files. These checks perform header-only analysis — they never load tensor data, making them safe to run on multi-gigabyte model files in CI pipelines. The checks are motivated by real production bugs: corrupt GGUF headers with impossible tensor counts (BUG-GGUF-001), sharded SafeTensors deployments missing their index file (BUG-212), and APR files shipped without CRC integrity checksums. The series is grounded in Sculley et al. (2015) on ML technical debt and the empirical observation that model binary metadata is the most common source of silent deployment failures.

Overview

The CB-1000 series integrates with two PMAT entry points: compliance checking via pmat comply check and model inventory via pmat analyze models.

# Run all compliance checks including CB-1000 series
pmat comply check

# Example output:
# ⚠ CB-1000: MLOps Model Quality (CB-1000 to CB-1008): [Advisory] 0 errors, 2 warnings, 1 info:
# CB-1000: Model directory has 2 model file(s) but no model card (README.md): llama-7b.gguf, llama-7b-q4.gguf (models:0)
# CB-1004: GGUF file missing `general.architecture` metadata key (BUG-EXPORT-004) (models/export.gguf:0)
# CB-1007: Model file is 13.2 GB — consider quantization or sharding (models/llama-70b-f16.gguf:0)

# Model inventory table
pmat analyze models --path ./models

# JSON output for CI pipelines
pmat analyze models --path ./models --format json

# Inventory with inline compliance checks
pmat analyze models --path ./models --check

The CB-1000 series is advisory — it reports with Warn status but does not block CI or commits. Violations are categorized into three severity tiers:

Severity	Meaning	Example
Error	Likely corrupt or broken deployment	>100K tensors (corrupt header), sharded without index
Warning	Missing metadata, potential issue	No model card, no tokenizer, missing architecture key
Info	Suggestion, low priority	File >10 GB, consider quantization

Supported Model Formats

PMAT recognizes three model binary formats by file extension:

Format	Extension	Ecosystem	Header Structure
GGUF	`.gguf`	llama.cpp, LLaMA, Mistral, Phi	Magic `0x46554747` (LE), version u32, tensor count u64, metadata KV pairs
SafeTensors	`.safetensors`	HuggingFace Transformers, Diffusers	Header length u64, JSON header with tensor dtype/shape/offsets
APR	`.apr`	aprender (batuta stack)	Magic `APR2`, metadata length u32, JSON metadata, CRC32 footer

The header parser extracts minimal metadata without loading tensor data:

#![allow(unused)]
fn main() {
pub struct ModelMetadata {
    pub format: ModelFormat,       // Gguf, Apr, or SafeTensors
    pub file_size_bytes: u64,
    pub tensor_count: Option<u64>, // From header parse
    pub architecture: Option<String>,
    pub has_crc: bool,             // APR CRC32 footer
}
}

Defect Taxonomy

Metadata & Documentation (CB-1000, CB-1002)

ID	Check	Severity	What it detects
CB-1000	Missing Model Card	Warning	Model directory has `.gguf`/`.safetensors`/`.apr` files but no `README.md` or `model_card.md`
CB-1002	Missing Tokenizer	Warning	GGUF model directory without `tokenizer.json`, `tokenizer.model`, or `vocab.json`

Format Validation (CB-1001, CB-1004, CB-1008)

ID	Check	Severity	What it detects
CB-1001	Oversized Tensor Count	Error	GGUF/APR/SafeTensors header reports >100,000 tensors — likely corrupt (BUG-GGUF-001)
CB-1004	Missing Architecture	Warning	GGUF file lacks `general.architecture` metadata key (BUG-EXPORT-004)
CB-1008	APR Missing CRC	Warning	APR file without CRC32 footer checksum for integrity verification

Size & Structure (CB-1005, CB-1006, CB-1007)

ID	Check	Severity	What it detects
CB-1005	Quantization Mismatch	Warning	Filename claims quantization (e.g., `f32`) but file size is inconsistent
CB-1006	Sharded Without Index	Error	Multiple `-of-.safetensors` shard files without `model.safetensors.index.json` (BUG-212)
CB-1007	Excessive File Size	Info	Individual model file >10 GB — consider quantization or sharding

`pmat analyze models` Command

The pmat analyze models command provides a model file inventory with optional compliance checking:

pmat analyze models --path ./my-project

Example table output:

Model Inventory (4 files, 18.7 GB total)
────────────────────────────────────────────────────────────────────────
File                                     Format          Size
────────────────────────────────────────────────────────────────────────
models/llama-7b-q4_0.gguf               GGUF          3.8 GB
models/llama-7b-f16.gguf                 GGUF         13.2 GB
weights/model.safetensors                SafeTensors    1.5 GB
weights/embeddings.apr                   APR          210.4 MB
────────────────────────────────────────────────────────────────────────

JSON output for CI integration:

pmat analyze models --path ./my-project --format json

{
  "model_count": 4,
  "total_size_bytes": 20078886912,
  "total_size_human": "18.7 GB",
  "models": [
    {
      "file": "models/llama-7b-q4_0.gguf",
      "format": "GGUF",
      "size_bytes": 4080218112,
      "size_human": "3.8 GB"
    },
    {
      "file": "weights/model.safetensors",
      "format": "SafeTensors",
      "size_bytes": 1610612736,
      "size_human": "1.5 GB"
    }
  ]
}

With --check, compliance violations are appended after the inventory:

pmat analyze models --path ./my-project --check

Model Inventory (2 files, 5.3 GB total)
────────────────────────────────────────────────────────────────────────
File                                     Format          Size
────────────────────────────────────────────────────────────────────────
models/export.gguf                       GGUF          3.8 GB
models/weights.safetensors               SafeTensors    1.5 GB
────────────────────────────────────────────────────────────────────────

⚠️ CB-1000: Model directory has 2 model file(s) but no model card (README.md): export.gguf, weights.safetensors (models)
⚠️ CB-1004: GGUF file missing `general.architecture` metadata key (BUG-EXPORT-004) (models/export.gguf)

The command aliases model and mlops are also accepted:

pmat analyze model --path ./models          # Singular alias
pmat analyze mlops --path ./models --check  # MLOps alias

Detection Algorithms

CB-1000: Missing Model Card

Groups model files by parent directory and checks for the presence of documentation:

models/
  llama-7b.gguf          # Model file present
  llama-7b-q4.gguf       # Model file present
                          # ❌ No README.md, readme.md, model_card.md, or MODEL_CARD.md

The check accepts any of four filenames as a valid model card: README.md, readme.md, model_card.md, MODEL_CARD.md. A single model card covers all model files in the same directory.

models/
  llama-7b.gguf          # ✅ Covered by README.md
  llama-7b-q4.gguf       # ✅ Covered by README.md
  README.md              # Model card present

CB-1001: Oversized Tensor Count

Parses the tensor count from each format’s header and flags values exceeding 100,000 as likely corrupt:

GGUF header layout:
  Offset 0:  u32 LE — magic (0x46554747)
  Offset 4:  u32 LE — version
  Offset 8:  u64 LE — tensor_count  ← checked against 100,000 limit
  Offset 16: u64 LE — metadata_kv_count

For SafeTensors, tensor count is derived by counting "dtype" fields in the JSON header (minus 1 if __metadata__ is present). For APR, tensor count is estimated by counting "name" fields in the JSON metadata block.

A tensor count of 200,000 in a GGUF file almost certainly indicates a corrupt or byte-swapped header, not an actual model with 200K tensors. This check catches BUG-GGUF-001 class defects.

CB-1002: Missing Tokenizer

For directories containing GGUF files (which are typically language models), checks for the presence of a companion tokenizer file:

# ❌ CB-1002 Warning — GGUF without tokenizer:
models/
  llama-7b.gguf

# ✅ Passes — tokenizer present:
models/
  llama-7b.gguf
  tokenizer.json          # or tokenizer.model, or vocab.json

This check only applies to directories containing .gguf files, since GGUF is the primary format for language models that require tokenizer files. SafeTensors and APR directories are not checked.

CB-1004: Missing Architecture (GGUF)

Performs a byte-level scan of the entire GGUF file for the string general.architecture. GGUF metadata keys are stored as length-prefixed strings in the binary, so a simple windows() scan reliably detects the key:

#![allow(unused)]
fn main() {
let needle = b"general.architecture";
let has_arch = content
    .windows(needle.len())
    .any(|w| w == needle);
}

GGUF files exported by conformant tools (llama.cpp, aprender) always include general.architecture with a value like "llama", "mistral", "phi", or "gpt2". Its absence indicates a broken export pipeline (BUG-EXPORT-004). Files smaller than 100 bytes are skipped as obviously incomplete.

CB-1005: Quantization Mismatch

Cross-references the quantization type claimed in the filename against file size heuristics. The detector recognizes 24 quantization identifiers:

q2_k, q3_k, q3_k_s, q3_k_m, q3_k_l, q4_0, q4_1, q4_k, q4_k_s, q4_k_m,
q5_0, q5_1, q5_k, q5_k_s, q5_k_m, q6_k, q6_k_l, q8_0, q8_1,
f16, f32, bf16, iq4_xs, iq4_nl

Currently, the mismatch detection focuses on the most extreme case: a filename claiming f32 quantization (full precision, 4 bytes per parameter) but having a suspiciously small file size (<100 KB). An F32 model with any meaningful number of parameters produces a file measured in megabytes at minimum.

# ❌ CB-1005 Warning:
model-f32.gguf   (size: 48 KB)  # F32 claim but tiny file — likely mislabeled

# ✅ Passes:
model-f32.gguf   (size: 14.2 GB)  # F32 with reasonable size
model-q4_0.gguf  (size: 3.8 GB)   # Q4_0 — no size assertion for quantized

CB-1006: Sharded SafeTensors Without Index

Detects the HuggingFace sharded SafeTensors pattern (model-NNNNN-of-NNNNN.safetensors) and verifies the companion index file exists:

# ❌ CB-1006 Error — sharded without index:
model/
  model-00001-of-00003.safetensors
  model-00002-of-00003.safetensors
  model-00003-of-00003.safetensors
  # Missing: model.safetensors.index.json

# ✅ Passes — index present:
model/
  model-00001-of-00003.safetensors
  model-00002-of-00003.safetensors
  model-00003-of-00003.safetensors
  model.safetensors.index.json

This is severity Error because a missing index makes it impossible for loading libraries to correctly map tensor names to shard files. This was the root cause of BUG-212 in production.

CB-1007: Excessive File Size

Flags individual model files exceeding 10 GB (the LARGE_MODEL_THRESHOLD constant) with an advisory to consider quantization or sharding:

# ℹ️ CB-1007 Info:
# Model file is 13.2 GB — consider quantization or sharding
models/llama-7b-f16.gguf  (13.2 GB)

# No flag (under threshold):
models/llama-7b-q4_0.gguf (3.8 GB)

This check applies to all three formats (GGUF, SafeTensors, APR) and is purely informational. Large models are not inherently broken, but they increase deployment costs and transfer times.

CB-1008: APR Missing CRC

Parses the APR header and checks for a CRC32 footer checksum. APR files store a 4-byte CRC at the end of the file for integrity verification during loading:

APR file layout:
  Offset 0:      "APR2" magic (4 bytes)
  Offset 4:      metadata_len (u32 LE)
  Offset 8:      JSON metadata (metadata_len bytes)
  ...
  Offset EOF-4:  CRC32 footer (4 bytes)  ← checked by CB-1008

The detector reads the last 4 bytes of the file to determine if a CRC footer is present. Files without a CRC footer cannot be integrity-verified on load.

File Walking

All CB-1000 checks share a common directory walker (walkdir_model_files) that recursively scans for files with extensions .gguf, .apr, and .safetensors. The walker skips common non-project directories to avoid scanning dependency caches:

Skipped Directory	Reason
`.git`	Version control internals
`node_modules`	JavaScript dependencies
`target`	Rust build artifacts
`.pmat`	PMAT cache directory
`vendor`	Vendored dependencies
`build`, `dist`	Build output directories
`__pycache__`, `.venv`	Python artifacts

Testing

Tests use tempfile::TempDir to create synthetic model files with controlled headers:

#![allow(unused)]
fn main() {
#[test]
fn test_cb1000_detects_missing_model_card() {
    let temp = tempfile::tempdir().unwrap();
    let models_dir = temp.path().join("models");
    fs::create_dir_all(&models_dir).unwrap();

    // Write a minimal valid GGUF header
    let mut gguf_header = vec![0x47u8, 0x47, 0x55, 0x46]; // GGUF magic
    gguf_header.extend_from_slice(&3u32.to_le_bytes());     // version 3
    gguf_header.extend_from_slice(&10u64.to_le_bytes());    // tensor_count
    gguf_header.extend_from_slice(&5u64.to_le_bytes());     // metadata_count
    gguf_header.resize(64, 0);
    fs::write(models_dir.join("model.gguf"), &gguf_header).unwrap();

    let violations = detect_cb1000_missing_model_card(temp.path());
    assert_eq!(violations.len(), 1);
    assert_eq!(violations[0].pattern_id, "CB-1000");
}

#[test]
fn test_cb1001_detects_oversized_tensor_count() {
    let temp = tempfile::tempdir().unwrap();
    let mut header = vec![0x47u8, 0x47, 0x55, 0x46, 3, 0, 0, 0]; // GGUF magic + version
    header.extend_from_slice(&200_000u64.to_le_bytes()); // oversized tensor_count
    header.resize(64, 0);
    fs::write(temp.path().join("bad.gguf"), &header).unwrap();

    let violations = detect_cb1001_oversized_tensor_count(temp.path());
    assert_eq!(violations.len(), 1);
    assert_eq!(violations[0].pattern_id, "CB-1001");
}

#[test]
fn test_cb1006_detects_sharded_without_index() {
    let temp = tempfile::tempdir().unwrap();
    // Create sharded SafeTensors files (minimal valid headers)
    let header = 8u64.to_le_bytes();
    let json = b"{\"a\":{}}";
    let mut content = Vec::new();
    content.extend_from_slice(&header);
    content.extend_from_slice(json);

    fs::write(temp.path().join("model-00001-of-00002.safetensors"), &content).unwrap();
    fs::write(temp.path().join("model-00002-of-00002.safetensors"), &content).unwrap();

    let violations = detect_cb1006_sharded_without_index(temp.path());
    assert_eq!(violations.len(), 1);
    assert_eq!(violations[0].pattern_id, "CB-1006");
}
}

Test coverage includes both positive detection (violation present) and negative cases (no violation when metadata is correct).

CI/CD Integration

# .github/workflows/model-quality.yml
name: MLOps Model Quality
on: [push, pull_request]

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          lfs: true  # Model files are typically in Git LFS
      - name: Install PMAT
        run: cargo install pmat
      - name: Model Inventory
        run: pmat analyze models --format json
      - name: Model Quality Checks
        run: |
          OUTPUT=$(pmat comply check 2>&1)
          echo "$OUTPUT"
          # Fail on Error-severity violations (CB-1001, CB-1006)
          if echo "$OUTPUT" | grep -q "CB-1000.*errors: [1-9]"; then
            echo "::error::CB-1000 series has Error-severity violations"
            exit 1
          fi

Remediation Priority

When pmat comply check reports CB-1000 violations, fix them in this priority order:

CB-1001 Errors (>100K tensors) — corrupt header, model will fail to load
CB-1006 Errors (sharded without index) — model loading will fail at runtime
CB-1004 — missing architecture prevents runtime format negotiation
CB-1005 — quantization mismatch indicates mislabeled export
CB-1000 — missing model card hinders reproducibility and auditability
CB-1002 — missing tokenizer prevents text generation pipelines
CB-1008 — missing CRC prevents integrity verification on load
CB-1007 — informational, consider quantization for deployment efficiency

Academic Foundations

Paper	Finding	Applied To
Sculley et al. (2015). “Hidden Technical Debt in Machine Learning Systems”	ML systems accumulate configuration, data, and serving debt beyond code debt	CB-1000, CB-1002, CB-1006
Amershi et al. (2019). “Software Engineering for Machine Learning”	Model metadata and versioning are top engineering challenges	CB-1000, CB-1004
GGUF Specification (llama.cpp, 2023)	GGUF v3 header format with typed metadata KV pairs	CB-1001, CB-1004, CB-1005
HuggingFace SafeTensors Format (2023)	JSON header with tensor dtype, shape, data offsets	CB-1001, CB-1006

Specification Reference

Full detection logic: src/cli/handlers/comply_cb_detect/model_quality.rs Model inventory handler: src/cli/handlers/analysis_handlers.rs (route_model_analysis) Aggregate check: src/cli/handlers/comply_handlers/check_handlers.rs (check_model_quality)

PMAT: The PAIML MCP Agent Toolkit