arxiv_ai 95% Match Research Paper Speech researchers,Audio engineers,Machine learning practitioners,Developers of voice-based applications 4 weeks ago

Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba

speech-audio › text-to-speech

📄 Abstract

Abstract: We introduce MAVE (Mamba with Cross-Attention for Voice Editing and Synthesis), a novel autoregressive architecture for text-conditioned voice editing and high-fidelity text-to-speech (TTS) synthesis, built on a cross-attentive Mamba backbone. MAVE achieves state-of-the-art performance in speech editing and very competitive results in zero-shot TTS, while not being explicitly trained on the latter task, outperforming leading autoregressive and diffusion models on diverse, real-world audio. By integrating Mamba for efficient audio sequence modeling with cross-attention for precise text-acoustic alignment, MAVE enables context-aware voice editing with exceptional naturalness and speaker consistency. In pairwise human evaluations on a random 40-sample subset of the RealEdit benchmark (400 judgments), 57.2% of listeners rated MAVE - edited speech as perceptually equal to the original, while 24.8% prefered the original and 18.0% MAVE - demonstrating that in the majority of cases edits are indistinguishable from the source. MAVE compares favorably with VoiceCraft and FluentSpeech both on pairwise comparisons and standalone mean opinion score (MOS) evaluations. For zero-shot TTS, MAVE exceeds VoiceCraft in both speaker similarity and naturalness, without requiring multiple inference runs or post-processing. Remarkably, these quality gains come with a significantly lower memory cost and approximately the same latency: MAVE requires ~6x less memory than VoiceCraft during inference on utterances from the RealEdit database (mean duration: 6.21s, A100, FP16, batch size 1). Our results demonstrate that MAVE establishes a new standard for flexible, high-fidelity voice editing and synthesis through the synergistic integration of structured state-space modeling and cross-modal attention.

Key Contributions

Introduces MAVE, a novel autoregressive architecture based on a cross-attentive Mamba backbone for high-fidelity voice editing and zero-shot TTS. MAVE achieves state-of-the-art performance in speech editing and competitive results in zero-shot TTS by efficiently modeling audio sequences and precisely aligning text with acoustics, enabling context-aware editing with exceptional naturalness and speaker consistency.

Business Value

Enables more natural and controllable voice synthesis for applications like personalized content creation, virtual assistants, and audio dubbing, potentially reducing production costs and time.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

Potentially feasible with efficient Mamba architecture, but real-time performance and computational resources for training/inference need further evaluation.

Limitations Addressed

Lack of high-fidelity voice editing capabilities,Limited performance in zero-shot TTS,Difficulty in achieving speaker consistency and naturalness

Performance Gains

Outperforms leading autoregressive and diffusion models on diverse, real-world audio for speech editing and zero-shot TTS.

Technical Tags

MambaCross-AttentionAutoregressive ModelsVoice EditingText-to-Speech (TTS)Zero-Shot TTSAudio Sequence ModelingSpeaker ConsistencyReal-time AudioSpeech Synthesis

Research Topics

Speech SynthesisVoice ManipulationDeep Learning ArchitecturesAudio Signal ProcessingNatural Language Processing

Methods & Architectures

MambaCross-AttentionAutoregressive ModelingSequence ModelingText Conditioning MambaCross-Attentive MambaAutoregressive Architecture

Applications & Tasks

Speech Technology Audio Production Virtual Assistants Content Creation High-fidelity voice editingZero-shot text-to-speech synthesisSpeaker consistencyNaturalness in synthesized speech Voice editingText-to-speech synthesisSpeaker adaptation

Datasets & Benchmarks

Datasets

RealEdit benchmark

Benchmarks

Pairwise human evaluations on RealEdit benchmark (57.2% perceptually equal, 24.8% original preferred, 18.0% MAVE preferred)

Perceptual evaluationHuman judgment

Related Fields

Speech ProcessingNatural Language ProcessingMachine LearningDeep LearningAudio Engineering

Keywords

MambaCross-AttentionVoice EditingText-to-SpeechTTSZero-Shot LearningSpeech SynthesisAudio GenerationSpeaker IdentityAutoregressive ModelsSequence ModelingDeep LearningAI

Academic Context

#Speech Synthesis#Voice Manipulation#Deep Learning Architectures#Audio Signal Processing#Natural Language Processing

Commercial Potential

Potential Products

Advanced voice editing softwareHigh-fidelity TTS enginesPersonalized voice assistantsAI-powered audio production tools

Target Industries

Media and EntertainmentGamingCustomer ServiceTechnology

Use Case Examples

Editing dialogue in films or podcastsGenerating custom voiceoversCreating realistic virtual charactersPersonalizing virtual assistant voices

Competitive Edge

Positions itself as a state-of-the-art solution for voice editing and competitive in zero-shot TTS, outperforming existing autoregressive and diffusion models.

Market Opportunity

Growing market for AI-driven audio and voice technologies.

Revenue Models

Licensing of technologySaaS for audio production toolsAPI access.

Resource Requirements

Compute Needs

Likely significant for training, inference requirements depend on Mamba's efficiency and model size.

Data Requirements

Requires diverse audio datasets for training TTS and voice editing tasks.

Deployment Constraints

Real-time processing for interactive applications might be challenging.

Scalability

Mamba's efficiency in sequence modeling suggests good scalability for longer audio sequences.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years for robust productization.

Patent Potential

High, due to novel architecture and performance claims.

View Full Paper Back to Papers