Showcase Benchmark

This example demonstrates the Qwen2.5-Coder showcase benchmark harness for measuring inference performance against baselines like Ollama and llama.cpp.

🏆 SHOWCASE COMPLETE (2026-01-18)

CORRECTNESS-012 fixed! Both GGUF and APR formats exceed 2X Ollama on GPU.

Qwen2.5-Coder-1.5B Results

FormatM=8M=16M=32Status
GGUF770.0 tok/s (2.65x)851.8 tok/s (2.93x)812.8 tok/s (2.79x)✅ PASS
Target582 tok/s (2X)582 tok/s (2X)582 tok/s (2X)-

Key Achievements

  • GGUF GPU: 851.8 tok/s = 2.93x Ollama (291 tok/s baseline)
  • CPU/GPU Parity: Verified - outputs match exactly
  • APR Format: Quantization preserved (Q4_K, Q6_K) through GGUF → APR conversion
  • File Size: 1.9GB APR file with full model fidelity

Run the Showcase

# APR GPU Benchmark (FEATURED)
MODEL_PATH=/path/to/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf \
  cargo run --example apr_gpu_benchmark --release --features cuda

# Full showcase benchmark suite
cargo run --release --example showcase_benchmark

Overview

The showcase_benchmark example implements:

  • Automated model downloading from Hugging Face
  • Side-by-side benchmarking against Ollama
  • Performance visualization
  • Regression detection
  • GGUF → APR conversion with quantization preservation

Test Matrix

ModelSizeGPU TargetGPU AchievedCPU Target
Qwen2.5-Coder-0.5B490MB500+ tok/sTBD150+ tok/s
Qwen2.5-Coder-1.5B1.1GB350+ tok/s824.7 tok/s75+ tok/s
Qwen2.5-Coder-7B4.4GB150+ tok/sTBD25+ tok/s
Qwen2.5-Coder-32B19GB40+ tok/sTBD6+ tok/s

Metrics

  • Throughput: Tokens per second (decode phase)
  • Prefill: Prompt processing speed
  • TTFT: Time to first token
  • Memory: Peak VRAM/RAM usage

See Also