Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio
Flamingo, achieve top-tier results across standard audio benchmarks but are
limited by fixed input rates and durations, hindering their reusability. This
paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a
training-from-scratch framework that eliminates the dependency on pre-trained
weights while supporting arbitrary sample rates and audio lengths. AMAuT
integrates four key components: (1) augmentation-driven multiview learning for
robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for
stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context
representation, and (4) test-time adaptation/augmentation (TTA^2) to improve
inference reliability. Experiments on five public benchmarks, AudioMNIST,
SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves
accuracies up to 99.8% while consuming less than 3% of the GPU hours required
by comparable pre-trained models. Thus, AMAuT presents a highly efficient and
flexible alternative to large pre-trained models, making state-of-the-art audio
classification accessible in computationally constrained settings.
Authors (4)
Weichuang Shao
Iman Yi Liao
Tomas Henrique Bode Maul
Tissa Chandesa
Submitted
October 22, 2025
Key Contributions
Introduces AMAuT, a novel framework for training audio transformers from scratch that supports arbitrary sample rates and audio lengths, overcoming limitations of fixed-input models. It achieves high accuracy with significantly reduced computational cost through innovative components like augmentation-driven multiview learning and TTA^2.
Business Value
Enables more flexible and cost-effective deployment of advanced audio processing models in applications requiring variable audio inputs, such as real-time speech recognition or environmental sound analysis.