loading

BPE Tokenization — Step-by-Step Visualization

mediumAIMLNLPTokenizationGenerative AI

Step through Byte Pair Encoding — watch the most frequent character pairs iteratively merge into subword tokens, building a vocabulary from scratch.

Algorithm Pattern

Iterative Pair Merging

Key Idea

BPE starts with character-level tokens and repeatedly merges the most frequent adjacent pair. This creates subword vocabulary — rare words decompose into known subwords, fixing the unknown-word problem.

Step-by-Step Approach

  1. Initialize vocabulary with all unique characters in the corpus.
  2. Count all adjacent character pair frequencies.
  3. Merge the most frequent pair into a new token.
  4. Update all occurrences in the corpus and repeat.
  5. Stop when vocabulary size reaches the target.

Common Gotchas

  • GPT-2 uses 50,257 BPE tokens; GPT-4 uses ~100,000 with byte-level BPE.
  • BPE is greedy — the merge order affects tokenization of the same text.
  • Whitespace is included in tokens (e.g. ' the') to preserve word boundaries.

Related Problems