Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 88% Match Research Paper NLP Researchers,Linguists,Cognitive Scientists,ML Engineers 1 week ago

Typoglycemia under the Hood: Investigating Language Models' Understanding of Scrambled Words

large-language-models › reasoning
📄 Abstract

Abstract: Research in linguistics has shown that humans can read words with internally scrambled letters, a phenomenon recently dubbed typoglycemia. Some specific NLP models have recently been proposed that similarly demonstrate robustness to such distortions by ignoring the internal order of characters by design. This raises a fundamental question: how can models perform well when many distinct words (e.g., form and from) collapse into identical representations under typoglycemia? Our work, focusing exclusively on the English language, seeks to shed light on the underlying aspects responsible for this robustness. We hypothesize that the main reasons have to do with the fact that (i) relatively few English words collapse under typoglycemia, and that (ii) collapsed words tend to occur in contexts so distinct that disambiguation becomes trivial. In our analysis, we (i) analyze the British National Corpus to quantify word collapse and ambiguity under typoglycemia, (ii) evaluate BERT's ability to disambiguate collapsing forms, and (iii) conduct a probing experiment by comparing variants of BERT trained from scratch on clean versus typoglycemic Wikipedia text; our results reveal that the performance degradation caused by scrambling is smaller than expected.
Authors (2)
Gianluca Sperduti
Alejandro Moreo
Submitted
October 24, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

This paper investigates the phenomenon of typoglycemia in language models, specifically BERT, by analyzing the British National Corpus to quantify word collapse and ambiguity under character scrambling. It hypothesizes that robustness stems from the low frequency of word collapse and the distinct contexts in which collapsed words appear.

Business Value

Improves the understanding of LLM robustness and limitations, leading to more reliable text processing systems and better insights into human reading mechanisms.