Gemini's SentencePiece tokenizer
Tokenization
- Tokenization
Text -> Token (~ 4 char / token)
- Character-level: splits each character, won't understand the meaning of each char (difficult to understand) -> can be used as hybrid for messy/typo-ridden
- Word level: splits at whitespace, huge vocabulary, run, runs, running -> everything diffent (just good for meaning)
- Subword: common word stay whole, complex words decompose, bounded vocabulary -> Golden Standard
Byte-pair encoding, WordPiece comes here
- Start with character, (underscore for space) -> tokens
- then starts merging frequent adjacent pairs & store it as new token
- repeats same -> recount & merge then convert to tokens
-> Greedy-frequency based compression - frequent char sequence in training corpus become token, rare stays split
- LLM training CORPUS - largest public domain dataset for LLM - approx 2 trillion tokens, multilingual, BERT - Toronto book corpus & wiki (3.3B tokens), BLOOM (multilingual 366B)
-> tokenizer is trained on CORPUS before the model is trained & frozen
- English tokens appear more frequently & 'the', 'ing' would be single token early as english corpus is huge while other lang will have less frequency could stay as single token or won't split into too.
Can be mitigated using multilingual tuned model, Prompt compression or routing non-english model to balanced tokenizer
- log token count / request