arxiv_cl 95% Match Research Paper LLM Researchers,NLP Engineers,Linguists 2 weeks ago

ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models

large-language-models › evaluation

📄 Abstract

Abstract: Existing benchmarks for large language models (LLMs) are largely restricted to high- or mid-resource languages, and often evaluate performance on higher-order tasks in reasoning and generation. However, plenty of evidence points to the fact that LLMs lack basic linguistic competence in the vast majority of the world's 3800+ written languages. We introduce ChiKhaPo, consisting of 8 subtasks of varying difficulty designed to evaluate the lexical comprehension and generation abilities of generative models. ChiKhaPo draws on existing lexicons, monolingual data, and bitext, and provides coverage for 2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of language coverage. We further show that 6 SOTA models struggle on our benchmark, and discuss the factors contributing to performance scores, including language family, language resourcedness, task, and comprehension versus generation directions. With ChiKhaPo, we hope to enable and encourage the massively multilingual benchmarking of LLMs.

Authors (2)

Emily Chang

Niyati Bafna

Submitted

October 19, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduces ChiKhaPo, a large-scale multilingual benchmark covering over 2700 languages for evaluating lexical comprehension and generation in LLMs. This benchmark addresses the critical gap in evaluating LLMs on low-resource languages, aiming to improve their basic linguistic competence across a wider range of the world's languages.

Business Value

Enables development of LLMs that are more equitable and functional across a much wider range of global languages, opening up new markets and applications for AI in diverse linguistic communities.

Paper Metadata

Innovation Type

New Benchmark/Dataset

Deployment Feasibility

High, as it provides a standardized way to evaluate LLMs for multilingual capabilities.

Limitations Addressed

Existing benchmarks are limited to high/mid-resource languages and focus on higher-order tasks, neglecting basic linguistic competence in the vast majority of languages.

Technical Tags

multilingual LLMslexical comprehensionlexical generationlow-resource languagesbenchmarklanguage coveragelinguistic competencenatural language processing

Research Topics

LLM EvaluationMultilingual NLPLow-Resource LanguagesLinguistic TheoryModel Robustness

Methods & Architectures

benchmark creationevaluation framework Large Language Models (LLMs)

Applications & Tasks

Natural Language Processing Computational Linguistics Evaluating LLM capabilitiesAddressing language bias in LLMsAssessing linguistic competence Lexical comprehensionLexical generationMultilingual NLP tasks

Datasets & Benchmarks

Datasets

ChiKhaPo

performance scores

Related Fields

Computational LinguisticsNatural Language ProcessingMachine LearningLinguistics

Keywords

LLMbenchmarkmultilinguallow-resource languageslexicalcomprehensiongenerationevaluationlinguistic competencenatural language processingNLPlanguage coverage

Academic Context

#LLM Evaluation#Multilingual NLP#Low-Resource Languages#Linguistic Theory#Model Robustness

Commercial Potential

Potential Products

Multilingual LLM evaluation toolsLLMs with improved low-resource language support

Target Industries

TechnologyPublishingEducationGlobal Communications

Use Case Examples

Evaluating translation modelsDeveloping LLMs for underrepresented languagesAssessing LLM's understanding of word meanings and usage

Competitive Edge

Offers broader language coverage than existing benchmarks, focusing specifically on foundational lexical abilities.

Market Opportunity

Growing demand for truly multilingual AI.

Revenue Models

N/A (benchmark)

Resource Requirements

Compute Needs

Moderate (for running evaluations)

Data Requirements

Access to lexicons, monolingual data, and bitext for 2700+ languages.

Deployment Constraints

Requires careful curation and maintenance of multilingual data.

Scalability

The benchmark itself is designed to be scalable to many languages.

Production Readiness

Maturity Level

Research/Development

Time to Market

N/A (benchmark)

Licensing

Likely open-source for the benchmark data and evaluation scripts.

Patent Potential

Low

View Full Paper Back to Papers