apr - APR Model Operations CLI

The apr command-line tool provides inspection, debugging, validation, and comparison capabilities for .apr model files. It follows Toyota Way principles for quality and visibility.

Installation

cargo install --path crates/apr-cli

Or build from the workspace:

cargo build --release -p apr-cli

The binary will be available at target/release/apr.

Commands Overview

Command	Description	Toyota Way Principle
`run`	Run model directly (auto-download, cache, execute)	Just-in-Time Production
`serve plan`	Pre-flight capacity planning (VRAM, throughput, contracts)	Poka-Yoke (Error Prevention)
`serve run`	Start inference server with GPU acceleration	Just-in-Time Production
`chat`	Interactive chat with language models	Genchi Genbutsu (Go and See)
`inspect`	View model metadata and structure	Genchi Genbutsu (Go and See)
`debug`	Debug output with optional drama mode	Visualization
`validate`	Validate integrity with quality scoring	Jidoka (Built-in Quality)
`diff`	Compare two models	Kaizen (Continuous Improvement)
`tensors`	List tensor names, shapes, and statistics	Genchi Genbutsu (Go to the Source)
`trace`	Layer-by-layer analysis with anomaly detection	Visualization
`lint`	Check for best practices and conventions	Jidoka (Built-in Quality)
`probar`	Export for visual regression testing	Standardization
`import`	Import from HuggingFace, local files, or URLs	Automation
`export`	Export to SafeTensors, GGUF formats	Automation
`pull`	Download and cache model (Ollama-style UX)	Automation
`list`	List cached models	Visibility
`rm`	Remove model from cache	Standardization
`convert`	Quantization (int8, int4, fp16) and optimization	Kaizen
`merge`	Merge models (average, weighted strategies)	Kaizen
`tree`	Model architecture tree view	Visualization
`hex`	Hex dump tensor data	Genchi Genbutsu
`flow`	Data flow visualization	Visualization
`bench`	Benchmark throughput (spec H12: >= 10 tok/s)	Measurement
`eval`	Evaluate model perplexity (spec H13: PPL <= 20)	Measurement
`profile`	Deep profiling with Roofline analysis	Genchi Genbutsu
`qa`	Falsifiable QA checklist for model releases	Jidoka
`qualify`	Cross-subcommand smoke test (does every tool handle this model?)	Jidoka
`showcase`	Qwen2.5-Coder showcase demo	Standardization
`check`	Model self-test: 10-stage pipeline integrity	Jidoka
`publish`	Publish model to HuggingFace Hub	Automation
`cbtop`	ComputeBrick pipeline monitor	Visualization
`compare-hf`	Compare local model (APR/GGUF/SafeTensors) against HuggingFace	Jidoka
`explain`	Explain errors, architecture, and tensors	Knowledge Sharing
`tui`	Interactive terminal UI	Visualization
`canary`	Regression testing via tensor statistics	Jidoka
`finetune`	Fine-tune model with LoRA (classification, test-gen)	Kaizen
`tune`	Hyperparameter search for fine-tuning	Kaizen

Serve Command

The serve command has two subcommands: plan (pre-flight capacity check) and run (start the server).

Serve Run

Start an OpenAI-compatible inference server with optional GPU acceleration.

# Basic server (CPU)
apr serve run model.gguf --port 8080

# GPU-accelerated server
apr serve run model.gguf --port 8080 --gpu

# Batched GPU mode (2.9x faster than Ollama)
apr serve run model.gguf --port 8080 --gpu --batch

Performance

Mode	Model	Throughput	vs Ollama	Memory
GPU (APR Q4K, GH-88)	Qwen 1.5B	240 tok/s	—	1.5 GB
GPU (batched M=16)	Qwen 1.5B	~850 tok/s	2.9x	1.9 GB
GPU (single GGUF)	Qwen 7B	~68 tok/s	0.2x	5.5 GB
CPU (baseline)	Qwen 1.5B	~18 tok/s	0.05x	1.1 GB
Ollama	Qwen 1.5B	~333 tok/s	1.0x	-

Endpoints

Endpoint	Method	Description
`/health`	GET	Health check
`/metrics`	GET	Prometheus metrics
`/v1/chat/completions`	POST	OpenAI-compatible chat
`/v1/completions`	POST	OpenAI-compatible completions
`/generate`	POST	Native generation endpoint

Example Request

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

Tracing Headers

Use the X-Trace-Level header for performance debugging:

# Token-level timing
curl -H "X-Trace-Level: brick" http://localhost:8080/v1/chat/completions ...

# Layer-level timing
curl -H "X-Trace-Level: layer" http://localhost:8080/v1/chat/completions ...

Serve Plan (Pre-flight Capacity Planning)

Before downloading or launching a model, apr serve plan computes VRAM budget, throughput estimates, and contract checks. Header-only — no weights loaded.

# Plan from a local file
apr serve plan model.gguf --gpu

# Plan from a HuggingFace repo (fetches only ~2KB config.json)
apr serve plan hf://Qwen/Qwen2.5-Coder-1.5B-Instruct --gpu

# Bare org/repo also works (auto-detected as HuggingFace)
apr serve plan microsoft/phi-2 --gpu --quant Q4_K_M

# JSON output for tooling
apr serve plan hf://mistralai/Mistral-7B-Instruct-v0.3 --gpu --format json

Flags:

Flag	Description
`--gpu`	Detect GPU via nvidia-smi for VRAM budget
`--quant <Q>`	Quantization override for HF models (e.g., Q4_K_M, Q6_K, F16)
`--batch-size <N>`	Target batch size for throughput estimation (default: 1)
`--seq-len <N>`	Sequence length for KV cache estimation (default: 4096)
`--format <F>`	Output format: text, json, yaml (default: text)

Contracts verified:

BUDGET-001: Total VRAM fits within 95% safety margin
BUDGET-002: Model weights loadable contiguous
BUDGET-003: KV cache fits at batch=1
BUDGET-004: Target batch size achievable (when batch > 1)

Verdict: READY, WARNINGS, or BLOCKED.

Tool Calling (GH-160)

The server supports OpenAI-compatible tool calling, allowing models to invoke external functions.

Define tools in your request:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get the current weather for a location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {"type": "string", "description": "City name"},
              "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
          }
        }
      }
    ],
    "max_tokens": 100
  }'

Response with tool call:

{
  "id": "chatcmpl-abc123",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_xyz789",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\": \"Tokyo\", \"unit\": \"celsius\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

Multi-turn with tool result:

After executing the tool, send the result back:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [
      {"role": "user", "content": "What is the weather in Tokyo?"},
      {"role": "assistant", "content": null, "tool_calls": [{"id": "call_xyz789", "type": "function", "function": {"name": "get_weather", "arguments": "{\"location\": \"Tokyo\"}"}}]},
      {"role": "tool", "tool_call_id": "call_xyz789", "content": "{\"temperature\": 22, \"condition\": \"sunny\"}"}
    ],
    "max_tokens": 100
  }'

The model will then generate a response incorporating the tool result.

Tool choice control:

{
  "tool_choice": "auto"
}

Options: "auto" (default), "none" (disable tools), or {"type": "function", "function": {"name": "specific_tool"}}.

Example code: See cargo run --example tool_calling_demo for a complete Rust example.

Chat Command

Interactive chat with language models (supports GGUF, APR, SafeTensors).

# Interactive chat (GPU by default)
apr chat model.gguf

# Force CPU inference
apr chat model.gguf --no-gpu

# Adjust generation parameters
apr chat model.gguf --temperature 0.7 --top-p 0.9 --max-tokens 512

Inspect Command

View model metadata, structure, and flags without loading the full payload.

# Basic inspection
apr inspect model.apr

# JSON output for automation
apr inspect model.apr --json

# Show vocabulary details
apr inspect model.apr --vocab

# Show filter/security details
apr inspect model.apr --filters

# Show weight statistics
apr inspect model.apr --weights

Source Type	Format	Example
HuggingFace	`hf://org/repo`	`hf://openai/whisper-tiny`
Local File	Path	`./model.safetensors`
URL	HTTP(S)	`https://example.com/model.bin`

Architecture	Flag	Auto-Detection
Whisper	`--arch whisper`	✓
LLaMA	`--arch llama`	✓
BERT	`--arch bert`	✓
Auto	`--arch auto` (default)	✓

Option	Description
`--quantize int8`	8-bit integer quantization
`--quantize int4`	4-bit integer quantization
`--quantize fp16`	16-bit floating point

Flag	Description
`--task classify`	Switch to classification evaluation mode
`--data FILE`	JSONL test set (`{"input":"...","label":N}`)
`--model-size SIZE`	Base model size hint: `0.5B`, `tiny`
`--num-classes N`	Number of output classes (default: 5)
`--generate-card`	Write HuggingFace README.md to checkpoint directory
`--json`	Machine-readable JSON output

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning