Data Strategy
The model is only as good as the fine-tuning data. Our primary data comes from four ground truth corpora in the paiml ecosystem.
12.0 Ground Truth Corpora (Tier 1)
Extracted via make prep-data → apr data prep (GH-7). These are high-quality,
hand-crafted Python implementations with full type annotations, docstrings,
and test coverage.
| Corpus | Raw Pairs | Description | Source Repo |
|---|---|---|---|
| depyler | ~11,841 | Algorithms, data structures, CLI patterns, TDD examples | ~/src/depyler/ |
| hf-gtc | ~3,535 | HuggingFace production recipes (training, inference, RAG) | ~/src/hf-ground-truth-corpus/ |
| jax-gtc | ~58 | JAX numerical computing (autodiff, transforms, training) | ~/src/jax-ground-truth-corpus/ |
| vllm-gtc | ~81 | vLLM inference optimization (KV cache, sampling, serving) | ~/src/vllm-ground-truth-corpus/ |
| Total | ~15,494 |
Extraction method: AST parsing extracts function/class definitions with docstrings. Instruction = signature + docstring reformulated as natural language. Response = full source code. Filtered by response length (3–200 lines).
12.0.1 Supplemental Datasets (Tier 2)
| Dataset | Size | Purpose | Source | Format |
|---|---|---|---|---|
| Code Reasoning | 20K | Chain-of-thought for complex problems | Synthetic from teacher model | JSONL (problem, reasoning, code) |
| Code Tests | 10K | Test-driven examples (input→test→code) | HumanEval/MBPP-style | JSONL (prompt, tests, solution) |
| Multilingual Code | 30K | Python/Rust/TS/Go/Java coverage | MultiPL-E format | JSONL (language, prompt, solution) |
| Calibration | 128 | Wanda/SparseGPT calibration | Random code samples | JSONL (text) |
12.1 Decontamination Protocol
Training data MUST NOT overlap with evaluation benchmarks. This is critical for leaderboard integrity.
n-gram decontamination: Remove any training sample whose 10-gram overlap with any HumanEval/MBPP/BigCodeBench problem exceeds 50%. This is a hard gate — no exceptions.
# GATE: Decontamination check before training
apr data decontaminate training.jsonl \
--reference humaneval.jsonl mbpp.jsonl bigcodebench.jsonl \
--ngram 10 --threshold 0.50 --json
# Or via Makefile:
make decontaminate DATA=data/instruct-corpus.jsonl
Implementation: alimentar::quality::decontaminate (alimentar#30)
wired into apr data decontaminate (aprender#415). Enforces AC-016
gate: fails if contamination rate >= 1%.
Time-based decontamination for LiveCodeBench: Any problem published within 90 days of training data generation is excluded. LiveCodeBench's rolling nature makes this mandatory.
12.2 Data Preparation Pipeline
# GATE: Validate teacher produces correct code BEFORE generating training data
apr eval teacher.apr --task classify --data humaneval.jsonl --json > teacher-baseline.json
# Verify teacher pass@1 meets minimum threshold (e.g., >60%) before proceeding
# Generate synthetic training data from validated teacher
apr chat teacher.apr --system "Generate code instruction pairs" \
--batch instructions.txt --json > code-instruct-raw.jsonl
# Format validation
apr validate --data code-instruct-raw.jsonl --format jsonl
# Quality scoring (alimentar)
alimentar quality code-instruct-raw.jsonl --min-score 80 -o code-instruct-clean.jsonl
# Decontamination gate
apr data decontaminate code-instruct-clean.jsonl \
--reference humaneval.jsonl mbpp.jsonl --ngram 10 --threshold 0.50
Bootstrapping discipline: Never generate training data from a teacher whose inference quality hasn't been verified. The pipeline is: import → eval teacher → generate data → validate data → decontaminate → train student.
12.3 Preference Pair Generation (PMAT-014)
DPO alignment requires preference pairs: (prompt, chosen, rejected) triples where "chosen" is a correct completion and "rejected" is an incorrect one. We generate these from N-sampling eval results.
# Step 1: Run N-sampling eval (generates N completions per problem)
make eval-humaneval CHECKPOINT=checkpoints/model.apr NUM_SAMPLES=10 TEMPERATURE=0.8
# Step 2: Generate preference pairs from eval results
make generate-preference-pairs EVAL_WORK_DIR=/tmp/eval-work-dir
# Output: data/preference-pairs.jsonl
# Step 3: Use for DPO training
apr finetune checkpoint.apr --method dpo --data data/preference-pairs.jsonl
Pair generation strategy: For each problem with at least 1 passing and 1 failing sample, create all (passing, failing) pairs. A problem with 3 passing and 7 failing samples produces 21 preference pairs. This maximizes training signal from each eval run.
Expected yield from 164 HumanEval problems at 85% pass@1 (N=10, T=0.8):
- ~140 problems with at least 1 pass → usable for pairs
- ~120 problems with mixed pass/fail → source of pairs
- ~500-1000 preference pairs per eval run
Implementation: scripts/generate-preference-pairs.sh reads the eval work directory, re-tests each sample to classify pass/fail, and outputs JSONL.