Gemini's SentencePiece tokenizer

Tokenization

  1. Character-level: splits each character, won't understand the meaning of each char (difficult to understand) -> can be used as hybrid for messy/typo-ridden
  2. Word level: splits at whitespace, huge vocabulary, run, runs, running -> everything diffent (just good for meaning)
  3. Subword: common word stay whole, complex words decompose, bounded vocabulary -> Golden Standard Byte-pair encoding, WordPiece comes here