Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 95% Match Research Paper ML Researchers,Audio Engineers,Speech Technologists 2 weeks ago

AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch

speech-audio › speech-recognition
📄 Abstract

Abstract: Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-scratch framework that eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths. AMAuT integrates four key components: (1) augmentation-driven multiview learning for robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context representation, and (4) test-time adaptation/augmentation (TTA^2) to improve inference reliability. Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8% while consuming less than 3% of the GPU hours required by comparable pre-trained models. Thus, AMAuT presents a highly efficient and flexible alternative to large pre-trained models, making state-of-the-art audio classification accessible in computationally constrained settings.
Authors (4)
Weichuang Shao
Iman Yi Liao
Tomas Henrique Bode Maul
Tissa Chandesa
Submitted
October 22, 2025
arXiv Category
cs.SD
arXiv PDF

Key Contributions

Introduces AMAuT, a novel framework for training audio transformers from scratch that supports arbitrary sample rates and audio lengths, overcoming limitations of fixed-input models. It achieves high accuracy with significantly reduced computational cost through innovative components like augmentation-driven multiview learning and TTA^2.

Business Value

Enables more flexible and cost-effective deployment of advanced audio processing models in applications requiring variable audio inputs, such as real-time speech recognition or environmental sound analysis.