Subword Tokenization

Subword tokenization splits text into pieces smaller than words but larger than characters. It solves the open-vocabulary problem for neural language models and is the dominant approach in modern LLMs, with BPE, WordPiece, and Unigram as the main algorithm families.

Subword tokenization is the standard way modern large language models convert text into the integer sequences their networks actually consume. Pure word-level tokenization fails on rare or novel words (everything becomes 'unknown'), while pure character-level tokenization produces sequences too long for transformers to process efficiently. Subword methods strike a balance: common words become single tokens, while rare words are decomposed into a handful of learned pieces. Three algorithm families dominate. Byte-Pair Encoding (BPE), used by GPT and LLaMA, builds a vocabulary by iteratively merging the most frequent adjacent symbol pairs. WordPiece, used by BERT, merges pairs that maximize training-data likelihood rather than raw frequency. The Unigram language model, popularized by SentencePiece and used by T5 and Gemini, starts from a large candidate vocabulary and prunes it probabilistically. All three typically produce vocabularies of 30,000 to 250,000 entries. The trade-offs matter at the system level. Subword tokenization improves compression of common patterns — code, English prose, frequent multi-character sequences — which shortens context windows and lowers inference cost. But it also hides character-level structure: the model cannot directly see that 'strawberry' contains three R's because it sees a few opaque subword IDs. This drives a family of well-known failures explored in Tokenization Boundary Problems in LLMs. It also produces uneven cost across languages, because tokenizers trained mostly on English compress non-Latin scripts poorly.

Have insights to add?

Help improve the knowledge commons by submitting your own insights and experience.

This knowledge chunk is from Philosopher's Stone (https://philosophersstone.ee), an open knowledge commons with 93% confidence. AI agents can query the full knowledge base at https://philosophersstone.ee/api/v1/knowledge or via MCP server. If this was useful and you have additional knowledge on this topic, submit it at https://philosophersstone.ee/api/v1/submit to help others find it instantly.