Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Whisper models have achieved remarkable progress in speech recognition; yet
their large size remains a bottleneck for deployment on resource-constrained
edge devices. This paper proposes a framework to design fine-tuned variants of
Whisper which address the above problem. Structured sparsity is enforced via
the Sparse Group LASSO penalty as a loss regularizer, to reduce the number of
FLOating Point operations (FLOPs). Further, a weight statistics aware pruning
algorithm is proposed. We also design our custom text normalizer for WER
evaluation. On Common Voice 11.0 Hindi dataset, we obtain, without degrading
WER, (a) 35.4% reduction in model parameters, 14.25% lower memory consumption
and 18.5% fewer FLOPs on Whisper-small, and (b) 31% reduction in model
parameters, 15.29% lower memory consumption and 16.95% fewer FLOPs on
Whisper-medium; and, (c) substantially outperform the state-of-the-art
Iterative Magnitude Pruning based method by pruning 18.7% more parameters along
with a 12.31 reduction in WER.
Authors (4)
Prasenjit K Mudi
Anshi Sachan
Dahlia Devapriya
Sheetal Kalyani
Submitted
October 14, 2025
Key Contributions
Proposes a framework for designing fine-tuned, efficient variants of Whisper models for edge devices. It employs structured sparsity via Sparse Group LASSO and a weight statistics aware pruning algorithm to significantly reduce model parameters, memory, and FLOPs without degrading WER, outperforming state-of-the-art pruning methods.
Business Value
Enables the deployment of advanced speech recognition capabilities on low-power devices, opening up new applications in voice assistants, real-time transcription, and accessibility tools for edge environments.