Case Study: Qwen2.5-Coder QA Playbook Results (2026-01-30)
This chapter documents the qualification testing of Qwen2.5-Coder-1.5B-Instruct using the apr-model-qa-playbook framework, which implements Popperian falsification methodology with Toyota Way quality principles.
Test Summary
| Metric | Result | Status |
|---|---|---|
| Tool Coverage | 12/12 (100%) | ✅ PASS |
| Conversion Tests | 0/7 (0%) | ❌ BLOCKED |
| MQS Score | N/A | ⚠️ Cannot compute (blocked) |
| Certification | NOT QUALIFIED | Blocked by GH-185 |
| APR Version | 0.2.12 | |
| Last Requalification | 2026-01-30 16:55 UTC | GH-185 still open |
Tool Coverage Testing (F-TOOL-*)
All 12 APR tools verified and passing:
Tool Status Exit Duration
------------------------------------------------------------
inspect ✅ PASS 0 1352ms
validate ✅ PASS 0 768ms
check ✅ PASS 0 2147ms
bench ✅ PASS 0 594ms
trace-none ✅ PASS 0 5250ms
trace-basic ✅ PASS 0 4434ms
trace-layer ✅ PASS 0 4707ms
trace-payload ✅ PASS 0 4559ms
profile ✅ PASS 0 4110ms
profile-ci ✅ PASS 0 2654ms
profile-ci-assertion ✅ PASS 1 2373ms
profile-ci-p99 ✅ PASS 0 2303ms
------------------------------------------------------------
Total: 12 passed, 0 failed
New Profile CI Features (F-PROFILE-006/007/008)
The apr profile command now supports CI mode with assertion checking:
# CI mode with throughput assertion
apr profile model.gguf --ci --assert-throughput 10.0 --warmup 3 --measure 10
# Output:
CI PROFILE REPORT (PMAT-192)
════════════════════════════════════════════════════════════
Throughput: 12.8 tok/s
Latency p50: 156.51 ms
Latency p99: 156.51 ms
ASSERTIONS
✅ PASS throughput: 12.8 tok/s (expected >= 10.0 tok/s)
Available Flags:
--ci- Enable assertion checking mode--assert-throughput N- Fail if throughput < N tok/s (exit code 1)--assert-p99 N- Fail if p99 latency > N ms--assert-p50 N- Fail if p50 latency > N ms--warmup N- Warmup passes before measurement--measure N- Measurement passes for statistics
Format Conversion Testing (F-CONV-*)
Status: BLOCKED by GH-185
All 7 conversion tests failing due to missing embedded tokenizer in APR format:
| Gate | Conversion | Observed Diff | Required | Status |
|---|---|---|---|---|
| F-CONV-G-A | GGUF → APR | 0.746 | < 1e-6 | ❌ FAIL |
| F-CONV-A-G | APR → GGUF | 0.560 | < 1e-6 | ❌ FAIL |
| F-CONV-G-S | GGUF → SafeTensors | NaN | < 1e-6 | ❌ FAIL |
| F-CONV-S-G | SafeTensors → GGUF | 0.560 | < 1e-6 | ❌ FAIL |
| F-CONV-A-S | APR → SafeTensors | NaN | < 1e-6 | ❌ FAIL |
| F-CONV-S-A | SafeTensors → APR | 0.748 | < 1e-6 | ❌ FAIL |
| F-CONV-RT-001 | Round-trip | NaN | < 1e-6 | ❌ FAIL |
Root Cause: GH-185
# GGUF inference - CORRECT
apr run model.gguf -p "What is 2+2?" --max-tokens 8 --no-gpu
# Output: "4"
# APR inference - WRONG (missing tokenizer)
apr rosetta convert model.gguf model.apr
apr run model.apr -p "What is 2+2?" --max-tokens 8 --no-gpu
# Error: [PMAT-172] APR file missing embedded tokenizer.
# Output: "1. What is the difference between a"
Five-Whys Analysis:
- Why wrong output? → Tokenizer missing from APR file
- Why missing? → Conversion only copies tensor data
- Why only tensors? → GGUF stores tokenizer in metadata fields
- Why not extracted? →
tokenizer.ggml.*fields not parsed - ROOT CAUSE: Converter focuses on weights, not model packaging
Upstream Issue Status
| Issue | Title | Severity | Status |
|---|---|---|---|
| #185 | APR missing embedded tokenizer | P0 | ⏳ OPEN |
| #184 | CI exit code on failure | P2 | ✅ CLOSED |
| #183 | GGUF v3 validation messages | P2 | ✅ FIXED |
| #182 | SafeTensors companion files | P1 | ✅ FIXED |
| #181 | Q4_K_M block alignment | P0 | ✅ FIXED |
Requalification History
| Date | APR Version | Tool Tests | Conversion | Result |
|---|---|---|---|---|
| 2026-01-30 16:55 | 0.2.12 | 12/12 ✅ | 0/7 ❌ | BLOCKED (GH-185) |
| 2026-01-30 (initial) | 0.2.11 | 12/12 ✅ | 0/7 ❌ | BLOCKED (GH-185) |
Next Steps: Requalify after GH-185 is merged and apr version >= 0.2.13 is released.
Running the QA Playbook
Install apr-qa CLI
git clone https://github.com/paiml/apr-model-qa-playbook
cd apr-model-qa-playbook
cargo build --release
Run Tool Tests
apr-qa tools /path/to/model.gguf --no-gpu
Run Full Playbook
apr-qa run playbooks/models/qwen2.5-coder-1.5b.playbook.yaml \
--subprocess --model-path /path/to/model.gguf --no-gpu
Generate Reports
apr-qa report output/evidence.json -o output/ --formats all --model "Qwen/Qwen2.5-Coder-1.5B-Instruct"
References
- apr-model-qa-playbook Specification
- Qwen2.5-Coder Showcase Demo
- Karl Popper, "The Logic of Scientific Discovery" (1934)
- Toyota Production System: Jidoka + Poka-Yoke