Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Research in linguistics has shown that humans can read words with internally
scrambled letters, a phenomenon recently dubbed typoglycemia. Some specific NLP
models have recently been proposed that similarly demonstrate robustness to
such distortions by ignoring the internal order of characters by design. This
raises a fundamental question: how can models perform well when many distinct
words (e.g., form and from) collapse into identical representations under
typoglycemia? Our work, focusing exclusively on the English language, seeks to
shed light on the underlying aspects responsible for this robustness. We
hypothesize that the main reasons have to do with the fact that (i) relatively
few English words collapse under typoglycemia, and that (ii) collapsed words
tend to occur in contexts so distinct that disambiguation becomes trivial. In
our analysis, we (i) analyze the British National Corpus to quantify word
collapse and ambiguity under typoglycemia, (ii) evaluate BERT's ability to
disambiguate collapsing forms, and (iii) conduct a probing experiment by
comparing variants of BERT trained from scratch on clean versus typoglycemic
Wikipedia text; our results reveal that the performance degradation caused by
scrambling is smaller than expected.
Authors (2)
Gianluca Sperduti
Alejandro Moreo
Submitted
October 24, 2025
Key Contributions
This paper investigates the phenomenon of typoglycemia in language models, specifically BERT, by analyzing the British National Corpus to quantify word collapse and ambiguity under character scrambling. It hypothesizes that robustness stems from the low frequency of word collapse and the distinct contexts in which collapsed words appear.
Business Value
Improves the understanding of LLM robustness and limitations, leading to more reliable text processing systems and better insights into human reading mechanisms.