Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: This research article examines the effectiveness of various pretraining
strategies for developing machine translation models tailored to low-resource
languages. Although this work considers several low-resource languages,
including Afrikaans, Swahili, and Zulu, the translation model is specifically
developed for Lingala, an under-resourced African language, building upon the
pretraining approach introduced by Reid and Artetxe (2021), originally designed
for high-resource languages. Through a series of comprehensive experiments, we
explore different pretraining methodologies, including the integration of
multiple languages and the use of both monolingual and parallel data during the
pretraining phase. Our findings indicate that pretraining on multiple languages
and leveraging both monolingual and parallel data significantly enhance
translation quality. This study offers valuable insights into effective
pretraining strategies for low-resource machine translation, helping to bridge
the performance gap between high-resource and low-resource languages. The
results contribute to the broader goal of developing more inclusive and
accurate NLP models for marginalized communities and underrepresented
populations. The code and datasets used in this study are publicly available to
facilitate further research and ensure reproducibility, with the exception of
certain data that may no longer be accessible due to changes in public
availability.
Authors (3)
Idriss Nguepi Nguefack
Mara Finkelstein
Toadoum Sari Sakayo
Submitted
October 29, 2025
ACL2025
Key Contributions
Investigates and demonstrates the effectiveness of pretraining strategies using both monolingual and parallel data for low-resource machine translation, specifically showing significant improvements for Lingala by leveraging multilingual pretraining.
Business Value
Enables the development of more accessible and affordable translation services for under-represented languages, fostering global communication and access to information.