Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The linguistic diversity of India poses significant machine translation
challenges, especially for underrepresented tribal languages like Bhili, which
lack high-quality linguistic resources. This paper addresses the gap by
introducing Bhili-Hindi-English Parallel Corpus (BHEPC), the first and largest
parallel corpus worldwide comprising 110,000 meticulously curated sentences
across Bhili, Hindi, and English. The corpus was created with the assistance of
expert human translators. BHEPC spans critical domains such as education,
administration, and news, establishing a valuable benchmark for research in low
resource machine translation. To establish a comprehensive Bhili Machine
Translation benchmark, we evaluated a wide range of proprietary and open-source
Multilingual Large Language Models (MLLMs) on bidirectional translation tasks
between English/Hindi and Bhili. Comprehensive evaluation demonstrates that the
fine-tuned NLLB-200 distilled 600M variant model outperforms others,
highlighting the potential of multilingual models in low resource scenarios.
Furthermore, we investigated the generative translation capabilities of
multilingual LLMs on BHEPC using in-context learning, assessing performance
under cross-domain generalization and quantifying distributional divergence.
This work bridges a critical resource gap and promotes inclusive natural
language processing technologies for low-resource and marginalized languages
globally.
Authors (4)
Pooja Singh
Shashwat Bhardwaj
Vaibhav Sharma
Sandeep Kumar
Submitted
November 1, 2025