Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 97% Match Research Paper Speech Synthesis Researchers,Audio Engineers,ML Engineers working on generative models 2 weeks ago

Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis

speech-audio › text-to-speech
📄 Abstract

Abstract: We propose Shallow Flow Matching (SFM), a novel mechanism that enhances flow matching (FM)-based text-to-speech (TTS) models within a coarse-to-fine generation paradigm. Unlike conventional FM modules, which use the coarse representations from the weak generator as conditions, SFM constructs intermediate states along the FM paths from these representations. During training, we introduce an orthogonal projection method to adaptively determine the temporal position of these states, and apply a principled construction strategy based on a single-segment piecewise flow. The SFM inference starts from the intermediate state rather than pure noise, thereby focusing computation on the latter stages of the FM paths. We integrate SFM into multiple TTS models with a lightweight SFM head. Experiments demonstrate that SFM yields consistent gains in speech naturalness across both objective and subjective evaluations, and significantly accelerates inference when using adaptive-step ODE solvers. Demo and codes are available at https://ydqmkkx.github.io/SFMDemo/.
Authors (5)
Dong Yang
Yiyi Cai
Yuki Saito
Lixu Wang
Hiroshi Saruwatari
Submitted
May 18, 2025
arXiv Category
eess.AS
arXiv PDF

Key Contributions

Introduces Shallow Flow Matching (SFM), a novel mechanism that enhances flow matching-based TTS models within a coarse-to-fine paradigm. SFM constructs intermediate states along FM paths and uses adaptive methods for their placement and construction, allowing inference to start from these states, significantly accelerating generation and improving speech naturalness.

Business Value

Enables faster and more natural-sounding synthetic speech generation, improving user experiences in voice assistants, audiobooks, virtual characters, and accessibility tools.