Tokenization Boundary Problems in LLMs

Large language models systematically fail at character-level tasks because their input is preprocessed into subword tokens, not letters. The notorious 'strawberry' counting failure, broken Caesar ciphers, and bungled string reversals all share the same root cause: the model literally never sees individual characters.

Modern LLMs do not read text as a stream of characters. Before training and inference, raw text passes through a tokenizer that splits it into subword tokenization|subword tokens — typically a few thousand to a few hundred thousand vocabulary entries learned with Byte-Pair Encoding (BPE)|BPE, WordPiece, or Unigram language model|Unigram algorithms (often via the SentencePiece library). The model sees integer token IDs and their learned embeddings, never the underlying letters. This creates a representational gap. The word 'strawberry' is encoded by GPT-4's cl100k_base tokenizer as roughly three tokens (variants like 'st' + 'raw' + 'berry'). To the network, 'strawberry' is no more decomposable into letters than the integer 42 is decomposable into prime factors at a glance — the information is recoverable only through learned statistical patterns. That is why the 'how many R's in strawberry?' question, trivial for any literate human, became a viral failure case for GPT-4, Claude, and Gemini alike in 2024. The same tokenization boundary explains a whole family of failures: counting words or characters in long strings, detecting palindromes, reversing arbitrary text, performing Caesar cipher|Caesar ciphers by hand, and inserting a letter after every occurrence of another. These tasks require treating text as a sequence of discrete characters, but the model only has subword chunks. Researchers have shown that transformer attention is provably limited in how reliably it can count or extract sub-token information, and that token embeddings carry weak signals about their constituent letters at best. Fine-tuning rarely fixes the problem because the input pipeline is unchanged — no amount of post-training teaches a model to see something the tokenizer has already hidden. Switching to byte-level or character-level tokenization works but inflates sequence length by 4-5x, with proportional cost in compute and context window. Practical mitigations rely on routing the work outside the model's character-blind eye. Tool use, especially calling a Python interpreter, lets the model delegate counting and string manipulation to code that does see characters. Prompting the model to spell the word out one character at a time before counting also helps, because each letter becomes its own token in the output. Chain-of-thought prompting and reasoning models such as OpenAI's o1 series (internally nicknamed 'Strawberry') reduce the failure rate by spending more inference compute on careful spelling, though the underlying constraint remains. See Large Language Models: How Next-Token Prediction Creates General Intelligence for the broader context of how token-level prediction shapes model behavior.

Tokenization Boundary Problems in LLMs

Related Knowledge

Byte-Pair Encoding (BPE)

Subword Tokenization

Have insights to add?