Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: This paper presents KIT's submissions to the IWSLT 2025 low-resource track.
We develop both cascaded systems, consisting of Automatic Speech Recognition
(ASR) and Machine Translation (MT) models, and end-to-end (E2E) Speech
Translation (ST) systems for three language pairs: Bemba, North Levantine
Arabic, and Tunisian Arabic into English. Building upon pre-trained models, we
fine-tune our systems with different strategies to utilize resources
efficiently. This study further explores system enhancement with synthetic data
and model regularization. Specifically, we investigate MT-augmented ST by
generating translations from ASR data using MT models. For North Levantine,
which lacks parallel ST training data, a system trained solely on synthetic
data slightly surpasses the cascaded system trained on real data. We also
explore augmentation using text-to-speech models by generating synthetic speech
from MT data, demonstrating the benefits of synthetic data in improving both
ASR and ST performance for Bemba. Additionally, we apply intra-distillation to
enhance model performance. Our experiments show that this approach consistently
improves results across ASR, MT, and ST tasks, as well as across different
pre-trained models. Finally, we apply Minimum Bayes Risk decoding to combine
the cascaded and end-to-end systems, achieving an improvement of approximately
1.5 BLEU points.
Authors (9)
Zhaolin Li
Yining Liu
Danni Liu
Tuan Nam Nguyen
Enes Yavuz Ugan
Tu Anh Dinh
+3 more
Key Contributions
This paper presents KIT's low-resource speech translation systems for IWSLT2025, focusing on enhancing cascaded and end-to-end systems for three low-resource language pairs. The key innovation lies in the effective utilization of synthetic data, including MT-augmented ST and TTS-generated speech, and model regularization techniques to improve performance in data-scarce scenarios.
Business Value
Enables more effective communication and information access for speakers of low-resource languages, opening up new markets and user bases for translation and speech technology providers.