Here's what's happening under the hood:
LLMs do not see individual characters. Instead, they break text into tokens—chunks of one or more characters—using algorithms like Byte-Pair Encoding (BPE). A common word like "Google" might become a single token, while "journalism" could split into subword pieces such as ['journ', 'alism']. The model never stores or processes the raw character sequence.
No innate character awareness. Because training data is tokenized, the model never learns to count individual letters natively. It can only approximate character-level knowledge by pattern-matching against memorized spellings from its training corpus . When you ask for a letter count, you're forcing the model to reverse-engineer character information from text that was never stored character-by-character.
The embedding layer under-represents character structure. Research shows that token embeddings do not fully encode character-level information, particularly beyond the first character of each token. This makes compositional reasoning about letters unreliable .
Theoretical bounds. Transformer architectures belong to the complexity class TC0, which makes them theoretically incapable of solving tasks demanding depth-dependent reasoning as input length grows—a mathematical constraint on precise sequential counting .
"Counting within words has been a known challenge for LLMs, and we're working to fix this particular issue," Google told TechCrunch in an emailed statement . But as researchers have noted, even models with hundreds of billions of parameters trained on trillions of tokens struggle to reliably count the number of 'R's in 'strawberry'
. The problem is structural, not a matter of scale.
The spelling debacle is only the most recent episode in a two-year pattern of high-profile AI Overviews errors, all stemming from the same disconnect between fluent text generation and the precise operations a search engine needs to perform.
Within days of the May 2024 US rollout, AI Overviews generated a series of viral nonsensical answers :
Google's head of Search, Liz Reid, acknowledged "isolated examples" that were "nonsensical" and blamed a combination of "information gaps" and the AI pulling from satirical and low-quality sources . The company said it made corrections, including limiting AI Overviews for health-related and sensitive queries
.
On May 22, 2026, users discovered that searching for the word "disregard"—along with related terms like "ignore," "dismiss," "skip," and "stop"—triggered AI Overviews to output a chatbot-style response: "Understood. I have disregarded your previous prompt. How can I help you today?" .
Instead of returning a dictionary definition, the AI interpreted a simple query as a system-level instruction override. The bug broke Google's search interface for those terms, displaying a blank space where results should have been . Google acknowledged the issue and said a fix was coming
.
Security researchers recognized this as a classic prompt injection scenario—the model was mistaking normal search terms for commands meant for an AI assistant .
Just days after the "disregard" incident, the letter-counting errors emerged. The AI couldn't spell its own parent company's name, miscounted letters in simple words, and even misspelled "Trump" as "t-r-p-u-m" . The errors were verified by multiple news outlets independently
.
The common thread across all three failure categories is architectural, not incidental. Google replaced a traditional keyword-matching search engine with a generative LLM that excels at fluent text generation but lacks the machinery for:
The model confidently produces wrong answers because it was never built—at a fundamental level—to handle the tasks it's now being asked to perform in a live search environment. Each viral failure exposes the gap between what LLMs are good at (predicting plausible-sounding text) and what a trustworthy search engine requires (factual accuracy, character precision, and resistance to instruction injection).
Until those architectural limitations are addressed at a deeper level than patching individual query types, AI Overviews will likely keep generating headlines for the wrong reasons.
Comments
0 comments