Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We introduce MAVE (Mamba with Cross-Attention for Voice Editing and
Synthesis), a novel autoregressive architecture for text-conditioned voice
editing and high-fidelity text-to-speech (TTS) synthesis, built on a
cross-attentive Mamba backbone. MAVE achieves state-of-the-art performance in
speech editing and very competitive results in zero-shot TTS, while not being
explicitly trained on the latter task, outperforming leading autoregressive and
diffusion models on diverse, real-world audio. By integrating Mamba for
efficient audio sequence modeling with cross-attention for precise
text-acoustic alignment, MAVE enables context-aware voice editing with
exceptional naturalness and speaker consistency. In pairwise human evaluations
on a random 40-sample subset of the RealEdit benchmark (400 judgments), 57.2%
of listeners rated MAVE - edited speech as perceptually equal to the original,
while 24.8% prefered the original and 18.0% MAVE - demonstrating that in the
majority of cases edits are indistinguishable from the source. MAVE compares
favorably with VoiceCraft and FluentSpeech both on pairwise comparisons and
standalone mean opinion score (MOS) evaluations. For zero-shot TTS, MAVE
exceeds VoiceCraft in both speaker similarity and naturalness, without
requiring multiple inference runs or post-processing. Remarkably, these quality
gains come with a significantly lower memory cost and approximately the same
latency: MAVE requires ~6x less memory than VoiceCraft during inference on
utterances from the RealEdit database (mean duration: 6.21s, A100, FP16, batch
size 1). Our results demonstrate that MAVE establishes a new standard for
flexible, high-fidelity voice editing and synthesis through the synergistic
integration of structured state-space modeling and cross-modal attention.
Key Contributions
Introduces MAVE, a novel autoregressive architecture based on a cross-attentive Mamba backbone for high-fidelity voice editing and zero-shot TTS. MAVE achieves state-of-the-art performance in speech editing and competitive results in zero-shot TTS by efficiently modeling audio sequences and precisely aligning text with acoustics, enabling context-aware editing with exceptional naturalness and speaker consistency.
Business Value
Enables more natural and controllable voice synthesis for applications like personalized content creation, virtual assistants, and audio dubbing, potentially reducing production costs and time.