The word "Google" might be encoded as a single token ["Google"] or as two tokens like ["Go", "ogle"]["G", "o", "o", "g", "l", "e"]
This creates two compounding problems:
First, the embedding layer doesn't fully encode character-level information. Research shows that LLM embedding layers store strong character information only for the first character of each token; beyond that, character-level detail degrades rapidly . When a model needs to count letters inside a token, it must reconstruct the character sequence from a representation that was never designed to preserve it. Later Transformer layers partially compensate—researchers have observed a distinct "breakthrough" point where the model manages to spell out a token correctly—but the process is unreliable and fragile
.
Second, subword tokenizers are "largely oblivious to the internal structure of tokens." A 2024 study from Arxiv coined the phrase "the curse of tokenization" to describe this vulnerability: tokenizers are inherently sensitive to typographical errors, length variations, and blind to the internal composition of the tokens themselves . A word like "journalism" might be a single token—the model never learned to decompose it into
j-o-u-r-n-a-l-i-s-m at the character level, so when asked to spell it out, it guesses.
The result is what users saw with Google's AI Overview: an AI that can debate philosophy and write code confidently insists there are two 'p's in "Google" and that "poop" contains exactly one 'r' .
If the problem is tokenization, the intuitive fix is to use character-level or byte-level models. Let the model see every letter. That approach exists—models like ByT5 operate directly on raw bytes—but it hasn't been widely adopted because it makes models dramatically more expensive to run .
Moving to pure character-level processing blows up sequence lengths by an estimated 3–5×, increasing compute costs proportionally and making it much harder for the model to learn long-range dependencies and semantic relationships . Subword tokenizers are the efficiency compromise that made modern LLMs practical: they compress text into manageable vocabulary sizes while preserving enough meaning for fluent language generation.
Researchers broadly agree that a "perfect" tokenizer likely does not exist . Tokenizers "routinely produce non-unique encodings" and create "representational mismatch" that is deeply architectural—not a simple bug to patch
. The trade-off between character-level precision and semantic fluency appears fundamental to the transformer architecture.
The spelling failures expose several structural limitations that apply well beyond Google's AI Overview.
LLMs are pattern matchers, not symbol manipulators. Counting letters is a trivial algorithmic task for any computer running traditional code, but LLMs don't execute algorithms—they predict the next most probable token based on statistical patterns in their training data . When asked for a letter count, the model generates a probable-sounding answer from learned associations, not a counting operation.
Confidence has no relationship to correctness. The AI volunteered "two" with perfect grammatical fluency yet was objectively wrong. This is a hallmark of LLM hallucination: confident, plausible-sounding outputs with no built-in verification mechanism. Google itself acknowledged in 2024 that while AI Overviews are "built to only show information that is backed up by top web results," they can still misinterpret queries or nuances in language .
The blind spot is architectural, not incidental. Every major LLM using subword tokenization—including models from OpenAI, Anthropic, and Meta—exhibits similar weaknesses on character-level tasks like spelling words backwards, counting letters, or handling anagrams . Scaling models larger helps somewhat, but the bias persists
.
These failures may seem embarrassing—an AI that can't spell its own company's name—but the industry doesn't treat them as a crisis because the enormous value of LLMs lies elsewhere.
Fluent text generation, summarization, reasoning, translation, code generation—all of these capabilities come from the model's ability to work at the semantic level, where token-level abstraction is a feature, not a bug . Character-level precision is simply not what these architectures are designed to optimize for.
The practical fix is to route spelling and counting queries to traditional rule-based software rather than asking the LLM to handle them. Several implementations of AI Overviews already attempt to detect and defer such queries, though the highly visible errors in May 2026 demonstrate that detection itself remains imperfect . A separate study found that Google's AI Overviews answer spelling-reversal queries incorrectly 52% of the time—and only 10% of words with three or more syllables were reversed correctly
.
Google is working on fixes for the specific counting issues made public . But for anyone who understands the tokenization trade-off, the real lesson isn't that Google shipped a buggy product. It's that the architecture powering the AI revolution has a fundamental blind spot—and nobody has found a way to fix it without sacrificing what makes LLMs valuable in the first place.
Comments
0 comments