Lab: Tokenizer

Build a BPE tokenizer to understand LLM text processing.

Objectives

  • Implement byte-pair encoding
  • Handle special tokens
  • Encode and decode text

Demo Code

See demos/course4/week1/llm-serving/

Lab Exercise

See labs/course4/week1/lab_1_7_tokenizer.py

Key Implementation

#![allow(unused)]
fn main() {
pub struct BpeTokenizer {
    vocab: HashMap<String, u32>,
    merges: Vec<(String, String)>,
    special_tokens: HashMap<String, u32>,
}

impl BpeTokenizer {
    pub fn encode(&self, text: &str) -> Vec<u32> {
        let mut tokens: Vec<String> = text.chars()
            .map(|c| c.to_string())
            .collect();

        // Apply merge rules
        for (a, b) in &self.merges {
            tokens = self.apply_merge(&tokens, a, b);
        }

        tokens.iter()
            .filter_map(|t| self.vocab.get(t).copied())
            .collect()
    }
}
}