Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Generating natural-sounding, multi-speaker dialogue is crucial for
applications such as podcast creation, virtual agents, and multimedia content
generation. However, existing systems struggle to maintain speaker consistency,
model overlapping speech, and synthesize coherent conversations efficiently. In
this paper, we introduce CoVoMix2, a fully non-autoregressive framework for
zero-shot multi-talker dialogue generation. CoVoMix2 directly predicts
mel-spectrograms from multi-stream transcriptions using a flow-matching-based
generative model, eliminating the reliance on intermediate token
representations. To better capture realistic conversational dynamics, we
propose transcription-level speaker disentanglement, sentence-level alignment,
and prompt-level random masking strategies. Our approach achieves
state-of-the-art performance, outperforming strong baselines like MoonCast and
Sesame in speech quality, speaker consistency, and inference speed. Notably,
CoVoMix2 operates without requiring transcriptions for the prompt and supports
controllable dialogue generation, including overlapping speech and precise
timing control, demonstrating strong generalizability to real-world speech
generation scenarios.
Authors (11)
Leying Zhang
Yao Qian
Xiaofei Wang
Manthan Thakker
Dongmei Wang
Jianwei Yu
+5 more
Key Contributions
Introduces CoVoMix2, a fully non-autoregressive framework for zero-shot multi-talker dialogue generation using flow matching. It directly predicts mel-spectrograms from transcriptions, eliminating intermediate tokens. Novel strategies like transcription-level speaker disentanglement, sentence-level alignment, and prompt-level random masking improve realism, achieving state-of-the-art performance in speech quality, speaker consistency, and inference speed.
Business Value
Enables the creation of more realistic and engaging synthetic dialogue for various media and interactive applications, reducing the cost and time associated with human voice actors and content production.