Case Study: Tokenizer Surgery
Ticket: GH-447
Module: aprender::online::tokenizer_surgery
Overview
Transplants embedding rows when adapting a pre-trained model to a new tokenizer. Supports direct copy, nearest-neighbor, and average-pool strategies for unmatched tokens.
Key Components
TokenizerSurgeryConfig— Vocab sizes, overlap threshold, surgery methodSurgeryMethod— DirectCopy, NearestNeighbor, AveragePoolVocabMapping— Bidirectional source/target index mappingcompute_vocab_overlap— O(n+m) vocabulary intersectiontransplant_embeddings— Row-by-row embedding transfervalidate_surgery— Quality gate on overlap ratio
Run
cargo run --example tokenizer_surgery
Falsification Tests
| ID | Property | Status |
|---|---|---|
| FALSIFY-SURGERY-001 | Overlap ratio in [0, 1] | Falsified (holds) |
| FALSIFY-SURGERY-002 | Transplant preserves dimensions | Falsified (holds) |
| FALSIFY-SURGERY-003 | Identical vocabs yield identity | Falsified (holds) |