arxiv_cl 95% Match Research Paper Machine translation researchers,NLP engineers working with low-resource languages,Linguists 1 week ago

Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation

large-language-models › training-methods

📄 Abstract

Abstract: This research article examines the effectiveness of various pretraining strategies for developing machine translation models tailored to low-resource languages. Although this work considers several low-resource languages, including Afrikaans, Swahili, and Zulu, the translation model is specifically developed for Lingala, an under-resourced African language, building upon the pretraining approach introduced by Reid and Artetxe (2021), originally designed for high-resource languages. Through a series of comprehensive experiments, we explore different pretraining methodologies, including the integration of multiple languages and the use of both monolingual and parallel data during the pretraining phase. Our findings indicate that pretraining on multiple languages and leveraging both monolingual and parallel data significantly enhance translation quality. This study offers valuable insights into effective pretraining strategies for low-resource machine translation, helping to bridge the performance gap between high-resource and low-resource languages. The results contribute to the broader goal of developing more inclusive and accurate NLP models for marginalized communities and underrepresented populations. The code and datasets used in this study are publicly available to facilitate further research and ensure reproducibility, with the exception of certain data that may no longer be accessible due to changes in public availability.

Authors (3)

Idriss Nguepi Nguefack

Mara Finkelstein

Toadoum Sari Sakayo

Submitted

October 29, 2025

arXiv Category

cs.CL

ACL2025

arXiv PDF

Key Contributions

Investigates and demonstrates the effectiveness of pretraining strategies using both monolingual and parallel data for low-resource machine translation, specifically showing significant improvements for Lingala by leveraging multilingual pretraining.

Business Value

Enables the development of more accessible and affordable translation services for under-represented languages, fostering global communication and access to information.

Paper Metadata

Innovation Type

Pretraining Methodology

Deployment Feasibility

High, as it focuses on improving existing MT models for practical applications.

Limitations Addressed

Addresses the challenge of building effective machine translation systems for low-resource languages where parallel data is scarce, by exploring how to best utilize available monolingual and parallel data during the pretraining phase.

Performance Gains

Pretraining on multiple languages and leveraging both monolingual and parallel data significantly enhance translation quality for low-resource languages like Lingala.

Technical Tags

low-resource machine translationpretraining strategiesmonolingual dataparallel dataLingalaAfrikaansSwahiliZulumultilingual pretrainingtransfer learning

Research Topics

Machine TranslationLow-Resource LanguagesTransfer LearningPretraining StrategiesNatural Language Processing

Methods & Architectures

Pretraining on monolingual and parallel dataMultilingual pretrainingFine-tuning for specific low-resource language (Lingala)Comparative analysis of pretraining approaches Transformer-based Machine Translation Models

Applications & Tasks

Machine Translation Natural Language Processing Linguistics Improving machine translation for low-resource languagesDetermining optimal pretraining strategiesLeveraging monolingual and parallel data effectively Machine Translation (specifically for Lingala)

Related Fields

Computational LinguisticsMachine LearningNatural Language ProcessingLinguistics

Keywords

Machine TranslationLow-Resource LanguagesPretrainingMonolingual DataParallel DataLingalaTransfer LearningMultilingual ModelsNLPTransformer

Academic Context

#Machine Translation#Low-Resource Languages#Transfer Learning#Pretraining Strategies#Natural Language Processing

Commercial Potential

Potential Products

Translation services for African languagesTools for language preservationCross-lingual communication platforms

Target Industries

TechnologyPublishingEducationInternational Development

Use Case Examples

Translating documents for humanitarian aid organizationsEnabling communication in multilingual communitiesDeveloping educational materials in local languages

Competitive Edge

Offers improved pretraining strategies that enhance translation quality for low-resource languages, potentially outperforming models trained with less optimized data utilization.

Production Readiness

Maturity Level

Research Prototype

View Full Paper Back to Papers