Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
π Abstract
Abstract: In this work, we address the challenge of building fair English ASR systems
for second-language speakers. Our analysis of widely used ASR models, Whisper
and Seamless-M4T, reveals large fluctuations in word error rate (WER) across 26
accent groups, indicating significant fairness gaps. To mitigate this, we
propose fairness-prompted finetuning with lightweight adapters, incorporating
Spectral Decoupling (SD), Group Distributionally Robust Optimization
(Group-DRO), and Invariant Risk Minimization (IRM). Our proposed fusion of
traditional empirical risk minimization (ERM) with cross-entropy and
fairness-driven objectives (SD, Group DRO, and IRM) enhances fairness across
accent groups while maintaining overall recognition accuracy. In terms of
macro-averaged word error rate, our approach achieves a relative improvement of
58.7% and 58.5% over the large pretrained Whisper and SeamlessM4T, and 9.7% and
7.8% over them, finetuning with standard empirical risk minimization with
cross-entropy loss.
Authors (6)
Monorama Swain
Bubai Maji
Jagabandhu Mishra
Markus Schedl
Anders SΓΈgaard
Jesper Rindom Jensen
Submitted
October 21, 2025
Key Contributions
This paper proposes a fairness-prompted finetuning method using lightweight adapters to build fairer English ASR systems for second-language speakers. By combining traditional ERM with fairness objectives (SD, Group-DRO, IRM), the approach significantly reduces Word Error Rate (WER) disparities across 26 accent groups while maintaining overall accuracy, outperforming standard finetuning.
Business Value
Enables the development of more inclusive and equitable voice-enabled technologies, expanding market reach to global users and improving user experience for non-native speakers in applications like customer service, virtual assistants, and dictation software.