Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
π Abstract
Abstract: Automatic Speech Recognition (ASR) systems, despite large multilingual
training, struggle in out-of-domain and low-resource scenarios where labeled
data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training
and Distillation), a novel framework designed to adapt Whisper's encoder using
unlabeled data. Unlike traditional self-supervised learning methods, BEARD
uniquely combines a BEST-RQ objective with knowledge distillation from a frozen
teacher encoder, ensuring the encoder's complementarity with the pre-trained
decoder. Our experiments focus on the ATCO2 corpus from the challenging Air
Traffic Control (ATC) communications domain, characterized by non-native
speech, noise, and specialized phraseology. Using about 5,000 hours of
untranscribed speech for BEARD and 2 hours of transcribed speech for
fine-tuning, the proposed approach significantly outperforms previous baseline
and fine-tuned model, achieving a relative improvement of 12% compared to the
fine-tuned model. To the best of our knowledge, this is the first work to use a
self-supervised learning objective for domain adaptation of Whisper.
Authors (3)
RaphaΓ«l Bagat
Irina Illina
Emmanuel Vincent
Submitted
October 28, 2025
Key Contributions
Proposes BEARD, a novel framework for adapting Whisper's ASR encoder using unlabeled data via a BEST-RQ objective and knowledge distillation. This approach effectively handles out-of-domain and low-resource scenarios, achieving significant improvements on the challenging ATCO2 corpus with limited transcribed data.
Business Value
Improves accuracy of speech recognition in specialized, noisy, and low-resource domains like aviation, leading to better communication tools and safety.