PMAT Roadmap

Work item dependency graph and critical path to AC-022 (leaderboard submission gate).

27.1 Work Item Summary

IDTitleStatusDepends OnACs
PMAT-006Baseline Evaluation GateDONEAC-021
PMAT-017Full Pipeline OrchestrationDONEAC-011, AC-027
PMAT-037GPU Training & ParityDONEAC-028, AC-029
PMAT-00732B→7B Text-Based DistillationDONE (pipeline)PMAT-006AC-003
PMAT-014Preference Pair GenerationIN PROGRESSPMAT-006AC-020
PMAT-008DPO Alignment PipelineREADYPMAT-014AC-020, AC-022
PMAT-010TIES Merge SpecialistsPENDINGPMAT-007, PMAT-008AC-006, AC-007, AC-024
PMAT-011Final Submission ArtifactPENDINGPMAT-010AC-008, AC-009, AC-022

27.2 Dependency DAG

PMAT-006 (DONE: 85.37% baseline)
├── PMAT-007 (DONE: adapter trained, merged, Q4K — awaiting eval)
│   └── PMAT-010 (PENDING: TIES merge)
│       └── PMAT-011 (PENDING: final artifact → AC-022)
├── PMAT-014 (IN PROGRESS: N-sampling preference pairs)
│   └── PMAT-008 (READY: DPO contract v2.0, pipeline defined)
│       └── PMAT-010 (PENDING: TIES merge)
└── PMAT-037 (DONE: wgpu training verified, 13 KAIZEN fixes)

PMAT-017 (DONE: 56 Makefile targets)

27.3 Critical Path

The shortest path to AC-022 (leaderboard submission):

PMAT-014 → PMAT-008 → PMAT-010 → PMAT-011 → AC-022
  (pairs)    (DPO)     (merge)    (quantize)   (gate)

Parallel track: PMAT-007 (distillation) feeds into PMAT-010 independently.

Critical Path Estimates

StepBlocking OnUnblocks
PMAT-014: Generate N-sampling pairsgx10 GPU (3h eval)PMAT-008
PMAT-008: DPO training on pairsgx10 GPU (40 min)PMAT-010
PMAT-007: Distillation fine-tunegx10 GPU (40 min)PMAT-010
PMAT-010: TIES merge two adaptersCPU (minutes)PMAT-011
PMAT-011: Prune → quantize → evalgx10 GPU (3h eval)AC-022 gate

27.4 AC Coverage by PMAT

ACRequired ByPMAT ItemCurrent Status
AC-002Perplexity baselinePMAT-006Verified (6.63 PPL)
AC-003Distillation qualityPMAT-007Verified (99/99 completions)
AC-006Merge norm preservationPMAT-010Contract written
AC-007TIES sign resolutionPMAT-010Contract written (ties-sign-resolution.yaml)
AC-008Pruning qualityPMAT-011Contract written (pruning-quality.yaml)
AC-009Quantization sizePMAT-011Verified (FT-QUANT-001 PASS, 35%)
AC-014HF parity gapPMAT-006Verified (HE 0.60pp, MBPP 3.2pp)
AC-015All FTs passAll59/60 (98.3%)
AC-020DPO alignmentPMAT-008Verified
AC-022Compound gate (HE+MBPP)PMAT-011FAIL (MBPP 76.2%)
AC-024Merge > specialistPMAT-010Not yet tested

27.5 Contract Coverage

Each PMAT item has associated provable contracts:

PMATContractsFTsMakefile TestsStatus
PMAT-006pass-at-k, inference-throughput, perplexity-baseline87All passing
PMAT-017pipeline-validation33All passing
PMAT-037wgsl-gemm-tiled, nf4-dequantization, fused-cross-entropy, gpu-output-norm, wgsl-transpose, forward-pass-perf, qlora-training-loop290 (GPU)pv L3
PMAT-007distillation, lora-finetune-eval, tokenizer-preservation95Pipeline done, eval pending
PMAT-014preference-pairs30 (pending N-sampling)Contract written
PMAT-008dpo-alignment v2.0, lora-finetune-eval80 (pending DPO)Contract v2.0 with e2e pipeline
PMAT-010merge-weight-norm v2.060 (pending merge)Contract v2.0 with AC-024 tests
PMAT-011leaderboard-gate, quantization, compile-binary94 (1 failing)MBPP gate

Total: 28 contract YAMLs, 98 proof obligations, 98 falsification tests, 10 Kani harnesses. Makefile gate: 59/60 passing.

27.6 Gap Analysis

MBPP Gap (3.8pp to AC-022)

Current: 76.2% → Target: 80.0%

StrategyExpected GainEvidence
DPO on borderline problems+2-4ppHumanEval few-shot +1.83pp from standard
Teacher distillation (32B→7B)+1-3pp32B is 90.85% vs 7B 85.37% on HumanEval
TIES merge (code + reasoning)+1-2ppLiterature: TIES > single specialist
N-sampling with temperature+0-1pppass@10 upper bound analysis

Conservative estimate: DPO alone should close 2-3pp, combined with distillation gets to 80%+.

Blocked Items

BlockerAffectsResolution
naga SPIR-V bugCooperative matrix GEMM (perf)Wait for naga fix or use tiled GEMM
GH-14 tokenizer lossAC-006, AC-008FIXED: GH-580 (merge) + GH-581 (quantize)
Q4K roundtrip corruptionPMAT-007 evalLIKELY FIXED: Previous "corruption" was caused by element-wise LoRA merge (wrong weights). Matmul fix deployed, v3 merge running. If Q4K quantize now works, this blocker is resolved.
SafeTensors FP16 importAC-014RESOLVED: AC-014 verified via benchmark scores (HE gap 0.60pp, MBPP gap 3.2pp). SafeTensors import not needed for parity verification.
SafeTensors FP16 importAC-023 (INT4 loss)Same-model FP16 vs Q4K comparison needs SafeTensors import

27.7 GH-580: Tokenizer Preservation Fix (2026-04-03)

Root cause: run_merge() used AprWriter (v1) which creates empty tokenizer. Base model is APR v2 with tokenizer in AprV2Metadata.custom HashMap.

Fix: Read base model with AprV2Reader, clone metadata (preserving tokenizer), use AprV2Writer for output. Also supports SafeTensors adapter input (wgpu training pipeline).

Impact: Unblocks PMAT-007 eval (distilled model can now run inference), PMAT-008 (DPO merge), PMAT-010 (TIES merge). All merge operations now preserve embedded tokenizer.

Contract: tokenizer-preservation-v1.yaml — 2 equations, 3 proof obligations, 3 falsification tests.

27.8 PMAT-007 Pipeline Artifacts (2026-04-03)

ArtifactSizePath (gx10)
Teacher completions240 KBdata/distill/teacher-completions.jsonl (99 prompts)
QLoRA adapter40 MBcheckpoints/qwen2.5-coder-7b-distilled-qlora.apr
Remapped adapter40 MBcheckpoints/qwen2.5-coder-7b-distilled-qlora-remapped.safetensors
Merged model (FP32)30 GBcheckpoints/qwen2.5-coder-7b-distilled-merged.apr
Quantized (Q4K)6.2 GBcheckpoints/qwen2.5-coder-7b-distilled-q4k.apr
Tokenizer7 MBcheckpoints/qwen2.5-coder-7b-distilled-q4k.tokenizer.json

Status (2026-04-03 18:39): GH-580 merge fix VERIFIED. Additionally, LoRA merge had a critical bug — element-wise multiply instead of matrix multiply (Hadamard product instead of GEMM). Five-whys traced to a "simplified" comment in merge engine. Fix: proper triple-loop GEMM computing B^T @ A^T with d_in/d_out inferred from flat arrays + rank. Fix deployed to gx10. All previous merged models (v1, v2) are invalid — must re-merge with corrected binary.

Next step: Re-merge distilled model after PMAT-014 N-sampling completes. Merge OOM-killed twice on gx10 (49 GB peak + 18 GB N-sampling exceeds 119 GB unified memory). Auto-merge pipeline (PID 1886069) queued — runs automatically when N-sampling finishes. Pipeline: merge → apr check → quantize Q4K → inference test.

N-sampling (PMAT-014): Running on gx10 with base 7B Q4K. 1157/1640 prompts completed (70.5%) as of 2026-04-04. Rate: ~47 prompts/hour. ETA: ~10h remaining. Work dir: /tmp/tmp.4izwh76p7m preserved with APR_KEEP_WORKDIR=1.

27.9 LoRA Merge Matmul Fix (2026-04-03)

Root cause: MergeEngine::merge() used element-wise multiply a[i%len]*b[i%len] (Hadamard product) instead of matrix multiply B @ A (GEMM). This produced garbage weight deltas that corrupted every merged model.

Five whys:

  1. Why garbage inference? Model weights corrupted after LoRA merge
  2. Why corrupted? MergeEngine::merge() produced wrong weight deltas
  3. Why wrong deltas? Used a[i%len]*b[i%len] (element-wise) not B@A (matmul)
  4. Why element-wise? Comment said "Simplified: just add scaled A and B values"
  5. Why not caught? No matrix multiply unit test, garbage only visible at inference

Fix: Replaced with proper GEMM — infer d_in/d_out from flat arrays + rank, compute B^T @ A^T with triple loop. O(d_out × d_in × rank) per tensor. Handles both standard and transposed LoRA conventions.

Impact: All PMAT-007 merged models must be regenerated. Critical path unchanged — merge takes minutes once N-sampling finishes.

27.10 Contract Coverage Update (2026-04-03)

3 new provable contracts written:

ContractACObligationsTests
binding-coverage.yamlAC-01233
hf-parity.yamlAC-01444
ties-sign-resolution.yamlAC-00744

Updated totals: 28 contracts, 98 proof obligations, 98 falsification tests, 10 Kani harnesses.

AC verification update: 19/29 verified (66%). Newly verified: AC-009 (Q4K size), AC-014 (HF parity), AC-023 (INT4 loss, 32B 1.65pp < 2pp), AC-025 (data quality, 0 duplicates, 0 short responses).