Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Existing benchmarks for large language models (LLMs) are largely restricted
to high- or mid-resource languages, and often evaluate performance on
higher-order tasks in reasoning and generation. However, plenty of evidence
points to the fact that LLMs lack basic linguistic competence in the vast
majority of the world's 3800+ written languages. We introduce ChiKhaPo,
consisting of 8 subtasks of varying difficulty designed to evaluate the lexical
comprehension and generation abilities of generative models. ChiKhaPo draws on
existing lexicons, monolingual data, and bitext, and provides coverage for
2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of
language coverage. We further show that 6 SOTA models struggle on our
benchmark, and discuss the factors contributing to performance scores,
including language family, language resourcedness, task, and comprehension
versus generation directions. With ChiKhaPo, we hope to enable and encourage
the massively multilingual benchmarking of LLMs.
Submitted
October 19, 2025
Key Contributions
Introduces ChiKhaPo, a large-scale multilingual benchmark covering over 2700 languages for evaluating lexical comprehension and generation in LLMs. This benchmark addresses the critical gap in evaluating LLMs on low-resource languages, aiming to improve their basic linguistic competence across a wider range of the world's languages.
Business Value
Enables development of LLMs that are more equitable and functional across a much wider range of global languages, opening up new markets and applications for AI in diverse linguistic communities.