Tokenization | Notion

Gemini's SentencePiece tokenizer

Tokenization

Character-level: splits each character, won't understand the meaning of each char (difficult to understand) -> can be used as hybrid for messy/typo-ridden
Word level: splits at whitespace, huge vocabulary, run, runs, running -> everything diffent (just good for meaning)
Subword: common word stay whole, complex words decompose, bounded vocabulary -> Golden Standard Byte-pair encoding, WordPiece comes here

Start with character, (underscore for space) -> tokens
- then starts merging frequent adjacent pairs & store it as new token
- repeats same -> recount & merge then convert to tokens -> Greedy-frequency based compression - frequent char sequence in training corpus become token, rare stays split
LLM training CORPUS - largest public domain dataset for LLM - approx 2 trillion tokens, multilingual, BERT - Toronto book corpus & wiki (3.3B tokens), BLOOM (multilingual 366B) -> tokenizer is trained on CORPUS before the model is trained & frozen
English tokens appear more frequently & 'the', 'ing' would be single token early as english corpus is huge while other lang will have less frequency could stay as single token or won't split into too. Can be mitigated using multilingual tuned model, Prompt compression or routing non-english model to balanced tokenizer
log token count / request