arxiv_cl 95% Match Research Paper ASR Researchers,Speech Technology Developers,Aviation Industry Professionals 1 week ago

BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

speech-audio › speech-recognition

📄 Abstract

Abstract: Automatic Speech Recognition (ASR) systems, despite large multilingual training, struggle in out-of-domain and low-resource scenarios where labeled data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a novel framework designed to adapt Whisper's encoder using unlabeled data. Unlike traditional self-supervised learning methods, BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder's complementarity with the pre-trained decoder. Our experiments focus on the ATCO2 corpus from the challenging Air Traffic Control (ATC) communications domain, characterized by non-native speech, noise, and specialized phraseology. Using about 5,000 hours of untranscribed speech for BEARD and 2 hours of transcribed speech for fine-tuning, the proposed approach significantly outperforms previous baseline and fine-tuned model, achieving a relative improvement of 12% compared to the fine-tuned model. To the best of our knowledge, this is the first work to use a self-supervised learning objective for domain adaptation of Whisper.

Authors (3)

Raphaël Bagat

Irina Illina

Emmanuel Vincent

Submitted

October 28, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Proposes BEARD, a novel framework for adapting Whisper's ASR encoder using unlabeled data via a BEST-RQ objective and knowledge distillation. This approach effectively handles out-of-domain and low-resource scenarios, achieving significant improvements on the challenging ATCO2 corpus with limited transcribed data.

Business Value

Improves accuracy of speech recognition in specialized, noisy, and low-resource domains like aviation, leading to better communication tools and safety.

Paper Metadata

Innovation Type

Framework/Methodology

Deployment Feasibility

High. Leverages existing Whisper model and focuses on adaptation techniques, making it practical for specialized ASR needs.

Limitations Addressed

Struggles of ASR systems in out-of-domain and low-resource scenarios, particularly with specialized jargon and noisy conditions like Air Traffic Control.

Performance Gains

12% relative improvement compared to fine-tuned baseline on ATCO2

Technical Tags

ASRdomain adaptationself-supervised learningWhisperencoder adaptationknowledge distillationlow-resource scenariosout-of-domainATCO2 corpusAir Traffic Controlnon-native speechnoise robustness

Research Topics

Automatic Speech RecognitionDomain AdaptationSelf-Supervised LearningLow-Resource NLPSpeech Technology

Methods & Architectures

BEST-RQ objectiveknowledge distillationencoder adaptationre-trainingfine-tuning Whisper encoderTransformer

Applications & Tasks

Air Traffic Control Speech Recognition Aviation Communications Out-of-Domain ASRLow-Resource ASRDomain Adaptation for Speech Speech RecognitionDomain Adaptation

Datasets & Benchmarks

Datasets

ATCO2 corpus

Benchmarks

Word Error Rate (WER) reduction of 12% relative improvement

Word Error Rate (WER)

Related Fields

Speech RecognitionMachine LearningNatural Language ProcessingAviationLow-Resource Languages

Keywords

ASRdomain adaptationWhisperself-supervised learninglow-resourceATCO2Air Traffic Controlknowledge distillationencoderspeech recognitionnoisenon-native speech

Academic Context

#Automatic Speech Recognition#Domain Adaptation#Self-Supervised Learning#Low-Resource NLP#Speech Technology

Technology Stack

Frameworks & Libraries

Whisper

Commercial Potential

Potential Products

Specialized ASR for aviationImproved voice control systems for noisy environmentsTools for transcribing challenging speech data

Target Industries

AviationAerospaceDefenseTelecommunications

Use Case Examples

Real-time transcription of air traffic control communicationsImproving voice command systems in loud industrial settingsDeveloping ASR for accents and noisy conditions

Competitive Edge

Offers a more effective domain adaptation method for large pre-trained ASR models like Whisper, especially in challenging environments.

Market Opportunity

Large market for specialized ASR solutions.

Revenue Models

Licensing of adapted modelsSaaS for transcription services.

Resource Requirements

Compute Needs

Moderate to High (for training/adaptation)

Data Requirements

Requires unlabeled speech data from the target domain and a small amount of transcribed data for fine-tuning.

Deployment Constraints

Performance is highly dependent on the quality and quantity of available domain-specific data.

Scalability

Adaptation framework is designed to work with large pre-trained models, allowing scalability.

Regulatory Considerations

Data privacy for voice recordings

Production Readiness

Maturity Level

Research

Time to Market

1-2 years

Patent Potential

Moderate

View Full Paper Back to Papers