arxiv_ml 95% Match Research Paper ML Researchers,Audio Engineers,Speech Technologists 2 weeks ago

AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch

speech-audio › speech-recognition

📄 Abstract

Abstract: Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-scratch framework that eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths. AMAuT integrates four key components: (1) augmentation-driven multiview learning for robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context representation, and (4) test-time adaptation/augmentation (TTA^2) to improve inference reliability. Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8% while consuming less than 3% of the GPU hours required by comparable pre-trained models. Thus, AMAuT presents a highly efficient and flexible alternative to large pre-trained models, making state-of-the-art audio classification accessible in computationally constrained settings.

Authors (4)

Weichuang Shao

Iman Yi Liao

Tomas Henrique Bode Maul

Tissa Chandesa

Submitted

October 22, 2025

arXiv Category

cs.SD

arXiv PDF

Key Contributions

Introduces AMAuT, a novel framework for training audio transformers from scratch that supports arbitrary sample rates and audio lengths, overcoming limitations of fixed-input models. It achieves high accuracy with significantly reduced computational cost through innovative components like augmentation-driven multiview learning and TTA^2.

Business Value

Enables more flexible and cost-effective deployment of advanced audio processing models in applications requiring variable audio inputs, such as real-time speech recognition or environmental sound analysis.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

High, due to efficient training and support for arbitrary input lengths, making it adaptable to diverse real-world audio streams.

Limitations Addressed

Fixed input rates and durations in existing models,Dependency on pre-trained weights,Computational cost of training

Performance Gains

Achieves up to 99.8% accuracy while consuming less than 3% of the GPU hours required by comparable models.

Technical Tags

TransformerMultiview LearningAudio ProcessingData AugmentationCNNTemporal EncodingTest-Time AdaptationFeature LearningSpeech RecognitionAudio Classification

Research Topics

Audio UnderstandingDeep Learning ArchitecturesModel RobustnessEfficient TrainingRepresentation Learning

Methods & Architectures

Augmentation-driven Multiview Learning1D CNN BottleneckDual CLS + TAL TokensTest-Time Adaptation/Augmentation (TTA^2) TransformerCNN

Applications & Tasks

Speech Recognition Audio Analysis Sound Event Detection Handling variable input rates/durationsImproving reusability of modelsRobustness to audio variationsEfficient training Audio ClassificationSpeech RecognitionSound Event Detection

Datasets & Benchmarks

Datasets

AudioMNIST, SpeechCommands V1, SpeechCommands V2, VocalSound, CochlScene

Benchmarks

AudioMNIST: 99.8% accuracy

AccuracyGPU hours

Related Fields

Machine LearningDeep LearningSignal ProcessingNatural Language Processing

Keywords

Audio TransformerMultiview LearningTraining from ScratchFlexible InputEfficient TrainingRobustnessSpeech RecognitionAudio ClassificationCNNTest-Time AdaptationTemporal EncodingDeep LearningFramework

Academic Context

#Audio Understanding#Deep Learning Architectures#Model Robustness#Efficient Training#Representation Learning

Commercial Potential

Potential Products

Flexible ASR systemsReal-time audio analysis toolsCustomizable audio event detectors

Target Industries

TechnologyMediaTelecommunicationsHealthcare

Use Case Examples

Speech recognition for variable-length recordingsEnvironmental sound classification in diverse settingsVoice command systems with adaptive input handling

Competitive Edge

Offers a more flexible and computationally efficient alternative to existing large audio models by enabling training from scratch and handling variable input rates/durations.

Resource Requirements

Compute Needs

Less GPU hours compared to existing models, but specific details not provided.

Data Requirements

Requires labeled audio datasets for training and evaluation.

Deployment Constraints

May require significant data for training from scratch, though efficiency is highlighted.

Scalability

Framework is designed for efficiency, suggesting good scalability.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers