Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent advances in unified multimodal models indicate a clear trend towards
comprehensive content generation. However, the auditory domain remains a
significant challenge, with music and speech often developed in isolation,
hindering progress towards universal audio synthesis. This separation stems
from inherent task conflicts and severe data imbalances, which impede the
development of a truly unified audio generation model. To address this
challenge, we propose UniMoE-Audio, a unified speech and music generation model
within a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework.
Architecturally, UniMoE-Audio introduces a Top-P routing strategy for dynamic
expert number allocation, and a hybrid expert design comprising routed experts
for domain-specific knowledge, shared experts for domain-agnostic features, and
null experts for adaptive computation skipping. To tackle data imbalance, we
introduce a three-stage training curriculum: 1) Independent Specialist Training
leverages original datasets to instill domain-specific knowledge into each
"proto-expert" without interference; 2) MoE Integration and Warmup incorporates
these specialists into the UniMoE-Audio architecture, warming up the gate
module and shared expert using a subset of balanced dataset; and 3) Synergistic
Joint Training trains the entire model end-to-end on the fully balanced
dataset, fostering enhanced cross-domain synergy. Extensive experiments show
that UniMoE-Audio not only achieves state-of-the-art performance on major
speech and music generation benchmarks, but also demonstrates superior
synergistic learning, mitigating the performance degradation typically seen in
naive joint training. Our findings highlight the substantial potential of
specialized MoE architecture and curated training strategies in advancing the
field of universal audio generation. Homepage:
https://mukioxun.github.io/Uni-MoE-site/home.html
Authors (16)
Zhenyu Liu
Yunxin Li
Xuanyu Zhang
Qixun Teng
Shenyuan Jiang
Xinyu Chen
+10 more
Submitted
October 15, 2025
Key Contributions
UniMoE-Audio presents a novel unified model for speech and music generation using a Dynamic-Capacity Mixture-of-Experts (MoE) framework. It addresses challenges of task conflicts and data imbalance with a hybrid expert design (routed, shared, null experts) and a Top-P routing strategy for dynamic expert allocation. A three-stage training curriculum is also introduced to manage data imbalance, paving the way for universal audio synthesis.
Business Value
Enables the creation of more versatile and efficient audio generation tools, capable of producing both realistic speech and diverse music. This can revolutionize content creation for entertainment, virtual assistants, and personalized media.