arxiv_cl 85% Match System Description Speech translation researchers,NLP engineers,Computational linguists 1 day ago

KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization

speech-audio › multimodal-audio

📄 Abstract

Abstract: This paper presents KIT's submissions to the IWSLT 2025 low-resource track. We develop both cascaded systems, consisting of Automatic Speech Recognition (ASR) and Machine Translation (MT) models, and end-to-end (E2E) Speech Translation (ST) systems for three language pairs: Bemba, North Levantine Arabic, and Tunisian Arabic into English. Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently. This study further explores system enhancement with synthetic data and model regularization. Specifically, we investigate MT-augmented ST by generating translations from ASR data using MT models. For North Levantine, which lacks parallel ST training data, a system trained solely on synthetic data slightly surpasses the cascaded system trained on real data. We also explore augmentation using text-to-speech models by generating synthetic speech from MT data, demonstrating the benefits of synthetic data in improving both ASR and ST performance for Bemba. Additionally, we apply intra-distillation to enhance model performance. Our experiments show that this approach consistently improves results across ASR, MT, and ST tasks, as well as across different pre-trained models. Finally, we apply Minimum Bayes Risk decoding to combine the cascaded and end-to-end systems, achieving an improvement of approximately 1.5 BLEU points.

Authors (9)

Zhaolin Li

Yining Liu

Danni Liu

Tuan Nam Nguyen

Enes Yavuz Ugan

Tu Anh Dinh

+3 more

Submitted

May 26, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper presents KIT's low-resource speech translation systems for IWSLT2025, focusing on enhancing cascaded and end-to-end systems for three low-resource language pairs. The key innovation lies in the effective utilization of synthetic data, including MT-augmented ST and TTS-generated speech, and model regularization techniques to improve performance in data-scarce scenarios.

Business Value

Enables more effective communication and information access for speakers of low-resource languages, opening up new markets and user bases for translation and speech technology providers.

Paper Metadata

Innovation Type

Methodological Improvement

Deployment Feasibility

Moderate. Requires pre-trained models and computational resources for fine-tuning and synthetic data generation, but the techniques are applicable to existing ST pipelines.

Limitations Addressed

Data scarcity in low-resource languages,Improving performance of ST systems with limited parallel data

Technical Tags

Speech TranslationAutomatic Speech RecognitionMachine TranslationLow-resource languagesSynthetic data generationModel regularizationCascaded systemsEnd-to-end systemsFine-tuningPre-trained models

Research Topics

Low-resource Speech TranslationData AugmentationModel AdaptationCross-lingual TransferSpeech Processing

Methods & Architectures

Fine-tuningMT-augmented STText-to-speech synthesis ASR modelsMT modelsEnd-to-end ST models

Applications & Tasks

Machine Translation Speech Processing Low-resource NLP Low-resource translationData scarcityModel generalization Speech TranslationAutomatic Speech RecognitionMachine Translation

Datasets & Benchmarks

Datasets

IWSLT 2025

Related Fields

Natural Language ProcessingComputational LinguisticsSpeech Technology

Keywords

Speech TranslationLow-resourceIWSLTASRMTSynthetic DataModel RegularizationCascaded SystemsEnd-to-End SystemsFine-tuningPre-trained ModelsLanguage PairsArabicBemba

Academic Context

KIT #Low-resource Speech Translation#Data Augmentation#Model Adaptation#Cross-lingual Transfer#Speech Processing

Companies & Organizations

Research Institutions

KIT

Commercial Potential

Potential Products

Low-resource speech translation servicesMultilingual communication tools

Target Industries

TelecommunicationsMediaGlobal BusinessEducation

Use Case Examples

Real-time translation of spoken content in under-represented languagesEnabling cross-lingual communication for humanitarian aid workers

Competitive Edge

Focuses on improving performance for low-resource languages through synthetic data and regularization, potentially outperforming generic models in these specific domains.

Resource Requirements

Compute Needs

Moderate to High (for fine-tuning and synthetic data generation)

Data Requirements

Low-resource parallel speech and text data, pre-trained models

Deployment Constraints

Performance may vary significantly based on the specific low-resource language pair and availability of related pre-trained models.

Scalability

Scalable to new low-resource language pairs with appropriate data and pre-trained models.

View Full Paper Back to Papers