arxiv_cl 95% Match Research Paper AI Researchers,Audio Engineers,Music Technologists,Content Creators 3 weeks ago

UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

speech-audio › audio-generation

📄 Abstract

Abstract: Recent advances in unified multimodal models indicate a clear trend towards comprehensive content generation. However, the auditory domain remains a significant challenge, with music and speech often developed in isolation, hindering progress towards universal audio synthesis. This separation stems from inherent task conflicts and severe data imbalances, which impede the development of a truly unified audio generation model. To address this challenge, we propose UniMoE-Audio, a unified speech and music generation model within a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework. Architecturally, UniMoE-Audio introduces a Top-P routing strategy for dynamic expert number allocation, and a hybrid expert design comprising routed experts for domain-specific knowledge, shared experts for domain-agnostic features, and null experts for adaptive computation skipping. To tackle data imbalance, we introduce a three-stage training curriculum: 1) Independent Specialist Training leverages original datasets to instill domain-specific knowledge into each "proto-expert" without interference; 2) MoE Integration and Warmup incorporates these specialists into the UniMoE-Audio architecture, warming up the gate module and shared expert using a subset of balanced dataset; and 3) Synergistic Joint Training trains the entire model end-to-end on the fully balanced dataset, fostering enhanced cross-domain synergy. Extensive experiments show that UniMoE-Audio not only achieves state-of-the-art performance on major speech and music generation benchmarks, but also demonstrates superior synergistic learning, mitigating the performance degradation typically seen in naive joint training. Our findings highlight the substantial potential of specialized MoE architecture and curated training strategies in advancing the field of universal audio generation. Homepage: https://mukioxun.github.io/Uni-MoE-site/home.html

Authors (16)

Zhenyu Liu

Yunxin Li

Xuanyu Zhang

Qixun Teng

Shenyuan Jiang

Xinyu Chen

+10 more

Submitted

October 15, 2025

arXiv Category

cs.SD

arXiv PDF

Key Contributions

UniMoE-Audio presents a novel unified model for speech and music generation using a Dynamic-Capacity Mixture-of-Experts (MoE) framework. It addresses challenges of task conflicts and data imbalance with a hybrid expert design (routed, shared, null experts) and a Top-P routing strategy for dynamic expert allocation. A three-stage training curriculum is also introduced to manage data imbalance, paving the way for universal audio synthesis.

Business Value

Enables the creation of more versatile and efficient audio generation tools, capable of producing both realistic speech and diverse music. This can revolutionize content creation for entertainment, virtual assistants, and personalized media.

Paper Metadata

Innovation Type

Algorithmic/Architectural

Deployment Feasibility

Moderate, MoE models can be computationally intensive, but dynamic capacity offers potential efficiency gains. Requires significant training data.

Limitations Addressed

Isolation of speech and music generation research,Task conflicts in unified audio models,Severe data imbalances between speech and music

Technical Tags

audio generationspeech synthesismusic generationunified modelMixture-of-Experts (MoE)Dynamic CapacityTop-P routingdata imbalancemulti-stage training

Research Topics

Generative AIAudio SynthesisSpeech TechnologyMusic TechnologyModel Architectures

Methods & Architectures

UniMoE-Audio frameworkDynamic-Capacity Mixture-of-Experts (MoE)Top-P routing strategyHybrid expert designThree-stage training curriculum Mixture-of-Experts (MoE)Dynamic-Capacity MoE

Applications & Tasks

Audio Synthesis Content Creation Media Production Unifying speech and music generationHandling task conflicts and data imbalancesDeveloping universal audio synthesis models Speech GenerationMusic GenerationUnified Audio Synthesis

Related Fields

Generative AIMachine LearningAudio Signal ProcessingMusic Information RetrievalNatural Language Processing

Keywords

audio generationspeech synthesismusic generationunified modelMoEMixture-of-Expertsdynamic capacitygenerative AIAIdeep learning

Academic Context

#Generative AI#Audio Synthesis#Speech Technology#Music Technology#Model Architectures

Commercial Potential

Potential Products

AI-powered music composition toolsAdvanced text-to-speech systemsGenerative audio platforms

Target Industries

MusicGamingMediaTechnologyAdvertising

Use Case Examples

Generating background music for videosCreating realistic voiceovers for virtual charactersDeveloping personalized audio experiences

Competitive Edge

Offers a novel unified approach to speech and music generation using a dynamic MoE architecture, tackling key challenges like data imbalance and task conflict more effectively than previous methods.

Market Opportunity

Large and growing, driven by AI content generation trends.

Revenue Models

SaaS for audio generation platformslicensing of modelsAPI access.

Resource Requirements

Compute Needs

Very High, for training large MoE models on diverse audio data.

Data Requirements

Large, diverse datasets of both speech and music.

Deployment Constraints

Computational cost for inference, potential latency issues with complex MoE routing.

Scalability

MoE architectures are designed for scalability by adding more experts, but dynamic capacity adds complexity.

Production Readiness

Maturity Level

Research

Time to Market

3-4 years for robust, production-ready systems.

Patent Potential

Moderate, for the dynamic MoE architecture and training strategies.

View Full Paper Back to Papers