Qwen Inference — LLM Inference with realizar

Aprender provides LLM inference through the realizar crate, accessible via the apr CLI or Rust API. The aprender crate handles model format conversion and training; all inference uses realizar for optimal throughput (225+ tok/s GPU, 30+ tok/s CPU on 7B Q4K).

Quick Start (CLI)

# Run inference via apr CLI (recommended)
apr run model.safetensors --prompt "What is 2+2?" --max-tokens 32

# Chat mode with interactive conversation
apr chat model.gguf

# Serve as HTTP API
apr serve model.apr --port 8080

Examples

Qwen Chat Demo

Demonstrates Qwen2 model configuration and tokenization setup:

cargo run --example qwen_chat

Qwen APR Native Format

Creates and loads a Qwen2-0.5B model in native APR v2 format:

cargo run --example qwen_apr_native

Production Workflow

# Import from HuggingFace
apr import hf://Qwen/Qwen2-0.5B-Instruct -o qwen2-0.5b.apr

# Quantize for deployment
apr convert qwen2-0.5b.apr --quantize q4k -o qwen2-0.5b-q4k.apr

# Validate quality
apr qa qwen2-0.5b-q4k.apr

# Run inference
apr run qwen2-0.5b-q4k.apr --prompt "Hello!" --max-tokens 64

Supported Model Formats

FormatCPUGPUNotes
GGUF (Q4K, Q6K)YesYesBest throughput, quantized
APR (native)YesYesEmbedded tokenizer, portable
SafeTensors (F32, F16)YesYes (if VRAM sufficient)Large, full precision

Qwen3.5 (Hybrid Attention)

Qwen3.5-9B-Instruct introduces hybrid attention — alternating standard softmax and Gated Delta Net linear attention layers. Key differences from Qwen2:

  • head_dim=256 (explicit, vs Qwen2's computed 128)
  • No attention bias (has_bias=false)
  • Hybrid layer_types — some layers are "linear", using O(n) recurrence
  • vocab_size=248320 (vs 152064 for Qwen2)
# Import Qwen3.5 (hybrid layers auto-detected from config.json)
apr import hf://Qwen/Qwen3.5-9B-Instruct -o qwen35.apr --arch qwen3_5

# Verify hybrid config
apr inspect qwen35.apr

The realizar inference engine automatically dispatches to the correct attention kernel per layer based on the layer_types config field. See the Qwen3.5 Hybrid Attention chapter for details.

Fine-Tuning

Both Qwen2.5 and Qwen3.5 models support classification fine-tuning via LoRA:

# Qwen3.5-9B: 6.8M trainable params (rank-16 LoRA on Q/V projections)
apr finetune --task classify --model-size 9B --plan
apr finetune --task classify --model-size 9B --data train.jsonl -o checkpoints/

# Qwen2.5-0.5B: 1.1M trainable params (smaller, good for testing)
apr finetune --task classify --model-size 0.5B --data train.jsonl -o checkpoints/

Key Qwen3.5 fine-tuning differences:

  • No attention bias — LoRA adapters target weight matrices only
  • 64 LoRA adapters — 32 layers x 2 targets (Q + V projections)
  • head_dim=256 — larger attention projections than Qwen2

See the LoRA Fine-Tuning chapter for theory and details.

See Also