arxiv_ai 90% Match Research Paper Speech synthesis researchers,AI developers for multimedia,Content creators 2 weeks ago

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

speech-audio › audio-generation

📄 Abstract

Abstract: Generating natural-sounding, multi-speaker dialogue is crucial for applications such as podcast creation, virtual agents, and multimedia content generation. However, existing systems struggle to maintain speaker consistency, model overlapping speech, and synthesize coherent conversations efficiently. In this paper, we introduce CoVoMix2, a fully non-autoregressive framework for zero-shot multi-talker dialogue generation. CoVoMix2 directly predicts mel-spectrograms from multi-stream transcriptions using a flow-matching-based generative model, eliminating the reliance on intermediate token representations. To better capture realistic conversational dynamics, we propose transcription-level speaker disentanglement, sentence-level alignment, and prompt-level random masking strategies. Our approach achieves state-of-the-art performance, outperforming strong baselines like MoonCast and Sesame in speech quality, speaker consistency, and inference speed. Notably, CoVoMix2 operates without requiring transcriptions for the prompt and supports controllable dialogue generation, including overlapping speech and precise timing control, demonstrating strong generalizability to real-world speech generation scenarios.

Authors (11)

Leying Zhang

Yao Qian

Xiaofei Wang

Manthan Thakker

Dongmei Wang

Jianwei Yu

+5 more

Submitted

June 1, 2025

arXiv Category

cs.SD

arXiv PDF

Key Contributions

Introduces CoVoMix2, a fully non-autoregressive framework for zero-shot multi-talker dialogue generation using flow matching. It directly predicts mel-spectrograms from transcriptions, eliminating intermediate tokens. Novel strategies like transcription-level speaker disentanglement, sentence-level alignment, and prompt-level random masking improve realism, achieving state-of-the-art performance in speech quality, speaker consistency, and inference speed.

Business Value

Enables the creation of more realistic and engaging synthetic dialogue for various media and interactive applications, reducing the cost and time associated with human voice actors and content production.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

High. Non-autoregressive models are generally faster for inference. Direct mel-spectrogram prediction simplifies the pipeline.

Limitations Addressed

Difficulty in maintaining speaker consistency,Inability to model overlapping speech effectively,Inefficiency of autoregressive models for dialogue generation,Challenges in zero-shot adaptation

Performance Gains

State-of-the-art performance,Improved inference speed

Technical Tags

dialogue generationzero-shotnon-autoregressiveflow matchingmulti-speakermel-spectrogramsspeaker disentanglementsentence alignmentprompt maskingspeech qualityspeaker consistency

Research Topics

Generative Models for Speech SynthesisMulti-speaker Dialogue GenerationNon-autoregressive ModelsZero-shot Speech Synthesis

Methods & Architectures

CoVoMix2 frameworkFully non-autoregressive flow matchingTranscription-level speaker disentanglementSentence-level alignmentPrompt-level random masking Flow-matching based generative modelsNon-autoregressive models

Applications & Tasks

Podcast creation Virtual agents Multimedia content generation Speech synthesis Maintaining speaker consistency in dialogueModeling overlapping speechSynthesizing coherent conversations efficientlyChallenges in zero-shot dialogue generation Generating natural-sounding multi-speaker dialogueAchieving high speech quality and speaker consistencyEfficient zero-shot dialogue synthesis

Datasets & Benchmarks

Benchmarks

State-of-the-art performance compared to MoonCast and Sesame

Speech qualitySpeaker consistencyInference speedCoherence

Related Fields

Speech SynthesisNatural Language ProcessingGenerative AIDeep Learning

Keywords

dialogue generationspeech synthesisnon-autoregressiveflow matchingmulti-speakerzero-shotCoVoMix2mel-spectrogramspeaker consistencyaudio generation

Academic Context

#Generative Models for Speech Synthesis#Multi-speaker Dialogue Generation#Non-autoregressive Models#Zero-shot Speech Synthesis

Commercial Potential

Potential Products

AI-powered voice actorsTools for automated podcast generationVirtual assistant voice customization platforms

Target Industries

Media & EntertainmentGamingCustomer ServiceTechnology

Use Case Examples

Generating realistic dialogue for video game characters.Creating synthetic voices for virtual assistants that can hold natural conversations.Automating the production of audiobooks or podcasts with multiple speakers.

Competitive Edge

Achieves state-of-the-art results in zero-shot dialogue generation with a fully non-autoregressive flow-matching approach, outperforming previous methods in key metrics.

Market Opportunity

Growing market for synthetic media and voice generation technologies.

Revenue Models

Licensing the modeloffering API access for dialogue generationdeveloping specialized voice creation tools.

Resource Requirements

Compute Needs

Moderate to high for training; efficient for inference.

Data Requirements

Requires multi-speaker dialogue datasets with transcriptions and speaker labels.

Deployment Constraints

Ensuring perfect speaker identity preservation and natural prosody across diverse inputs remains a challenge.

Scalability

Non-autoregressive nature aids scalability for generation speed.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years for commercial applications.

Patent Potential

Moderate, for the specific flow-matching architecture and disentanglement techniques.

View Full Paper Back to Papers