Appendix G: Data Pipeline
Documents the Phase 1 data ingestion, tokenization, and augmentation pipeline.
Source Corpora
| Source | Repository | Files | Rows | Parquet Size |
|---|---|---|---|---|
| depyler | depyler examples + TDD book | 1,843 | 1,843 | 6MB |
| hf-ground-truth | HuggingFace ground truth corpus | 11,928 | 11,493 | 197MB |
| jax-ground-truth | JAX ground truth corpus | 2,697 | 2,637 | 50MB |
| vllm-ground-truth | vLLM ground truth corpus | 1,118 | 1,100 | 18MB |
All sources are Python code, collected via alimentar import local.
Training Mix
Weighted sampling with Tier 1 (depyler) upsampled:
alimentar mix \
depyler.parquet:0.4 \
hf.parquet:0.3 \
jax.parquet:0.15 \
vllm.parquet:0.15 \
--output mixed.parquet \
--seed 42
Result: 17,070 rows (depyler upsampled 3.7x from 1,843 to ~6,829).
Data Splits
| Split | Rows | Size | Seed | Weights |
|---|---|---|---|---|
| train | 17,070 | 201MB | 42 | depyler:0.4 hf:0.3 jax:0.15 vllm:0.15 |
| val | 500 | 7MB | 123 | equal 0.25 each |
| test | 200 | 2.4MB | 456 | equal 0.25 each |
FIM Augmentation
Fill-in-the-Middle transforms applied via alimentar fim:
alimentar fim mixed.parquet \
--output mixed-fim.parquet \
--column text \
--rate 0.5 \
--format psm \
--seed 42
- Format: PSM (Prefix-Suffix-Middle)
- Rate: 50% of rows receive FIM transform
- Sentinel tokens:
<|fim_prefix|>,<|fim_suffix|>,<|fim_middle|>
BPE Tokenizer
Trained via apr tokenize apply:
apr tokenize apply \
--data corpus-raw.txt \
--vocab-size 32768 \
--algorithm bpe \
--max-lines 100000 \
-o tokenizer/
Results:
- Final vocab size: 32,768
- Merges: 32,518
- Training time: 2022.5s (~33.7 min)
- Training data: 100K lines of Python code
- Special tokens:
<unk>,<s>,</s>,<pad> - Python pattern coverage: 8/8 (
def,return,self,import,class,for,if,in) - Output:
tokenizer/vocab.json+tokenizer/merges.txt
HuggingFace tokenizer.json Conversion
Entrenar requires HuggingFace tokenizer.json format, but apr tokenize apply
produces raw vocab.json + merges.txt. A Python conversion step bridges the gap
(ALB-033):
from tokenizers import Tokenizer, models, pre_tokenizers, decoders
bpe = models.BPE(vocab=vocab, merges=merges, end_of_word_suffix='</w>')
tokenizer = Tokenizer(bpe)
tokenizer.pre_tokenizer = pre_tokenizers.Split(pattern=' ', behavior='removed')
tokenizer.decoder = decoders.BPEDecoder(suffix='</w>')
tokenizer.save('models/albor-tokenizer/tokenizer.json')
Key details:
- Merges must be string format (
"i n") not array format (["i", "n"]) - Pre-tokenizer matches aprender’s
split_whitespace()behavior </w>end-of-word suffix matches aprender’s BPE encoding- Regular vocab: 32,768 tokens (IDs 0-32767)
- FIM special tokens: 3 additional (IDs 32768-32770)
Parquet Schema
All data files use a consistent schema:
{
text: Utf8, -- Python source code
source: Utf8, -- Corpus name (depyler, hf, jax, vllm)
file: Utf8 -- Original file path
}
Provenance
SHA-256 hashes for all data artifacts are recorded in docs/PROVENANCE.md.
Each split uses a different random seed for reproducibility.
ByteLevel BPE Tokenizer (v2)
The v1 tokenizer (from apr tokenize apply) normalizes whitespace, which loses
Python indentation. The v2 tokenizer uses ByteLevel BPE (like GPT-2/CodeLlama):
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
tokenizer.decoder = decoders.ByteLevel()
trainer = trainers.BpeTrainer(vocab_size=32768, special_tokens=[...])
tokenizer.train(["corpus-raw.txt"], trainer)
tokenizer.save("models/albor-tokenizer-v2/tokenizer.json")
- Vocab: 32,768 (same size, different encoding)
- Roundtrip: 6/6 PASS (preserves newlines, indentation, blank lines)
- Merges: 32,557
Pre-Tokenized Data
Training data pre-tokenized and chunked for efficient training:
| Dataset | Sequences | Seq Length | Total Tokens | Format |
|---|---|---|---|---|
| pretokenized-2048/train (v1) | 22,079 | 2048 | 45.2M | Parquet (input_ids: List<u32>) |
| pretokenized-2048/val | 814 | 2048 | 1.7M | Parquet (input_ids: List<u32>) |
| pretokenized-2048-v2/train | 67,977 | 2048 | 139M | Parquet (input_ids: List<u32>) |
| pretokenized-2048-v2/val | 814 | 2048 | 1.7M | Parquet (reused from v1) |
Pre-tokenization avoids the entrenar↔aprender BPE compatibility issue (ALB-033)
and enables direct input_ids column loading.
v2 Data Expansion (2026-03-03)
The v2 dataset expands from Tier 1 only to Tier 1 (10x upsampled) + 8 Tier 2 repos:
| Source | Type | Files | Weight |
|---|---|---|---|
| depyler | Tier 1 | 1,843 | 10x |
| hf-ground-truth | Tier 1 | 11,493 | 10x |
| jax-ground-truth | Tier 1 | 2,637 | 10x |
| vllm-ground-truth | Tier 1 | 1,100 | 10x |
| pytorch | Tier 2 | 3,801 | 1x |
| hf-repos | Tier 2 | 19,781 | 1x |
| mlflow | Tier 2 | 1,780 | 1x |
| vllm-full | Tier 2 | 2,239 | 1x |
| tgi | Tier 2 | 372 | 1x |
| algo-corpus | Tier 2 | 186 | 1x |
| cuda-python | Tier 2 | 157 | 1x |
| llms-with-hf | Tier 2 | 37 | 1x |
Pipeline: source-to-parquet.py → alimentar mix → alimentar fim (50% PSM) → pretokenize.py
Key finding: alimentar import local expects data files (CSV/JSON/Parquet),
not source code directories. The workaround script scripts/source-to-parquet.py
converts Python repos to Parquet with the Tier 1 schema (file, source, text columns).
Result: 45,420 mixed rows → 67,977 pretokenized sequences × 2048 = 139M tokens (191 MiB).
Tools Used
alimentar import local— JSONL to Parquet conversionalimentar mix— weighted sampling with upsamplingalimentar fim— Fill-in-the-Middle augmentationapr tokenize plan/apply— BPE vocabulary training (v1, whitespace-split)- Python
tokenizers— ByteLevel BPE training (v2, whitespace-preserving) scripts/source-to-parquet.py— Python source code to Parquet (for Tier 2 repos)entrenar(parquet feature) — Parquet-to-LMBatch bridge for training