arxiv_cl 95% Match Research Paper AI researchers,NLP engineers,Linguists,Developers of multilingual AI systems 1 week ago

Testing Cross-Lingual Text Comprehension In LLMs Using Next Sentence Prediction

large-language-models › evaluation

📄 Abstract

Abstract: While large language models are trained on massive datasets, this data is heavily skewed towards English. Does their impressive performance reflect genuine ability or just this data advantage? To find out, we tested them in a setting where they could not rely on data abundance: low-resource languages. Building on prior work Agarwal et al. (2025) that used Next Sentence Prediction (NSP) as a test, we created a large-scale benchmark with 10,000 questions each for English (a high-resource language), Swahili (medium-resource), and Hausa (low-resource). We then tested several top models, including GPT-4 Turbo, Gemini 1.5 Flash, and LLaMA 3 70B, to see how their performance holds up. The results painted a clear picture of how levels of language resources impact outcomes. While all models excelled in English, their accuracy dropped in Swahili and fell sharply in Hausa, with LLaMA 3 struggling the most. The story became even more interesting when we introduced Chain-of-Thought (CoT) prompting. For the struggling LLaMA 3, CoT acted as a helpful guide, significantly boosting its accuracy. However, for the more capable GPT-4 and Gemini, the same technique often backfired, leading to a kind of "overthinking" that hurt their results in the cross-lingual context. This reveals that Chain-of-Thought is not a universal solution; its effectiveness depends heavily on the model's baseline capability and the specific context of the task. Our framework pinpoints LLM weaknesses, highlights when CoT helps or hinders cross-lingual NSP performance, and factors influencing their decisions.

Authors (2)

Ritesh Sunil Chavan

Jack Mostow

Submitted

October 29, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Develops a large-scale benchmark with 10,000 questions each for English, Swahili, and Hausa to test cross-lingual text comprehension in LLMs using Next Sentence Prediction. The study reveals significant performance drops in lower-resource languages, highlighting the impact of data bias.

Business Value

Provides crucial evidence for the limitations of current LLMs in truly understanding and processing low-resource languages, guiding efforts to develop more equitable and globally capable AI systems.

Paper Metadata

Innovation Type

Benchmark Dataset and Evaluation Methodology

Deployment Feasibility

High for research and development, as the benchmark facilitates systematic evaluation of LLM cross-lingual capabilities.

Limitations Addressed

Addresses the question of whether LLMs' impressive performance is due to genuine ability or English data advantage by testing them in low-resource language settings where data abundance cannot be relied upon.

Performance Gains

Models excel in English, show decreased accuracy in Swahili, and sharp declines in Hausa. LLaMA 3 struggles the most.

Technical Tags

cross-lingual comprehensionLLMsnext sentence predictionlow-resource languagesEnglishSwahiliHausadata advantageChain-of-Thoughtlanguage resources

Research Topics

Cross-Lingual NLPLarge Language ModelsLow-Resource LanguagesModel EvaluationLinguistic Bias

Methods & Architectures

Next Sentence Prediction (NSP) taskCreation of a large-scale multilingual benchmarkTesting top LLMs (GPT-4 Turbo, Gemini 1.5 Flash, LLaMA 3)Analysis of performance drop across resource levelsIntroduction of Chain-of-Thought (CoT) prompting Large Language Models (LLMs)GPT-4 TurboGemini 1.5 FlashLLaMA 3

Applications & Tasks

Natural Language Processing Machine Translation Cross-lingual AI Testing genuine cross-lingual comprehension in LLMsQuantifying the impact of data resource levels on LLM performanceEvaluating LLM capabilities in low-resource languages Cross-Lingual Text ComprehensionNext Sentence Prediction

Related Fields

Natural Language ProcessingMachine LearningLinguisticsArtificial IntelligenceCross-Cultural Communication

Keywords

Cross-LingualLLM EvaluationLow-Resource LanguagesNext Sentence PredictionSwahiliHausaData BiasEnglishChain-of-ThoughtLanguage ResourcesText Comprehension

Academic Context

#Cross-Lingual NLP#Large Language Models#Low-Resource Languages#Model Evaluation#Linguistic Bias

Commercial Potential

Potential Products

More equitable multilingual AI toolsImproved cross-lingual information accessTools for assessing LLM fairness across languages

Target Industries

TechnologyGlobal CommunicationsEducationResearch

Use Case Examples

Developing AI assistants that work equally well in all languagesEnsuring fair access to information regardless of languageAssessing the true understanding capabilities of LLMs

Competitive Edge

Offers a rigorous evaluation methodology and benchmark to expose the performance disparities of LLMs across different language resource levels, highlighting areas for improvement.

Production Readiness

Maturity Level

Benchmark Dataset

View Full Paper Back to Papers