Qwen Inference — LLM Inference with realizar

Aprender provides LLM inference through the realizar crate, accessible via the apr CLI or Rust API. The aprender crate handles model format conversion and training; all inference uses realizar for optimal throughput (225+ tok/s GPU, 30+ tok/s CPU on 7B Q4K).

Quick Start (CLI)

# Run inference via apr CLI (recommended)
apr run model.safetensors --prompt "What is 2+2?" --max-tokens 32

# Chat mode with interactive conversation
apr chat model.gguf

# Serve as HTTP API
apr serve model.apr --port 8080

Examples

Qwen Chat Demo

Demonstrates Qwen2 model configuration and tokenization setup:

cargo run --example qwen_chat

Qwen APR Native Format

Creates and loads a Qwen2-0.5B model in native APR v2 format:

cargo run --example qwen_apr_native

Production Workflow

# Import from HuggingFace
apr import hf://Qwen/Qwen2-0.5B-Instruct -o qwen2-0.5b.apr

# Quantize for deployment
apr convert qwen2-0.5b.apr --quantize q4k -o qwen2-0.5b-q4k.apr

# Validate quality
apr qa qwen2-0.5b-q4k.apr

# Run inference
apr run qwen2-0.5b-q4k.apr --prompt "Hello!" --max-tokens 64

Supported Model Formats

Format	CPU	GPU	Notes
GGUF (Q4K, Q6K)	Yes	Yes	Best throughput, quantized
APR (native)	Yes	Yes	Embedded tokenizer, portable
SafeTensors (F32, F16)	Yes	Yes (if VRAM sufficient)	Large, full precision

Qwen3.5 (Hybrid Attention)

Qwen3.5-9B-Instruct introduces hybrid attention — alternating standard softmax and Gated Delta Net linear attention layers. Key differences from Qwen2:

head_dim=256 (explicit, vs Qwen2's computed 128)
No attention bias (has_bias=false)
Hybrid layer_types — some layers are "linear", using O(n) recurrence
vocab_size=248320 (vs 152064 for Qwen2)

# Import Qwen3.5 (hybrid layers auto-detected from config.json)
apr import hf://Qwen/Qwen3.5-9B-Instruct -o qwen35.apr --arch qwen3_5

# Verify hybrid config
apr inspect qwen35.apr

The realizar inference engine automatically dispatches to the correct attention kernel per layer based on the layer_types config field. See the Qwen3.5 Hybrid Attention chapter for details.

Fine-Tuning

Both Qwen2.5 and Qwen3.5 models support classification fine-tuning via LoRA:

# Qwen3.5-9B: 6.8M trainable params (rank-16 LoRA on Q/V projections)
apr finetune --task classify --model-size 9B --plan
apr finetune --task classify --model-size 9B --data train.jsonl -o checkpoints/

# Qwen2.5-0.5B: 1.1M trainable params (smaller, good for testing)
apr finetune --task classify --model-size 0.5B --data train.jsonl -o checkpoints/

Key Qwen3.5 fine-tuning differences:

No attention bias — LoRA adapters target weight matrices only
64 LoRA adapters — 32 layers x 2 targets (Q + V projections)
head_dim=256 — larger attention projections than Qwen2

See the LoRA Fine-Tuning chapter for theory and details.

Aprender — Pure Rust ML Framework