Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We propose Shallow Flow Matching (SFM), a novel mechanism that enhances flow
matching (FM)-based text-to-speech (TTS) models within a coarse-to-fine
generation paradigm. Unlike conventional FM modules, which use the coarse
representations from the weak generator as conditions, SFM constructs
intermediate states along the FM paths from these representations. During
training, we introduce an orthogonal projection method to adaptively determine
the temporal position of these states, and apply a principled construction
strategy based on a single-segment piecewise flow. The SFM inference starts
from the intermediate state rather than pure noise, thereby focusing
computation on the latter stages of the FM paths. We integrate SFM into
multiple TTS models with a lightweight SFM head. Experiments demonstrate that
SFM yields consistent gains in speech naturalness across both objective and
subjective evaluations, and significantly accelerates inference when using
adaptive-step ODE solvers. Demo and codes are available at
https://ydqmkkx.github.io/SFMDemo/.
Authors (5)
Dong Yang
Yiyi Cai
Yuki Saito
Lixu Wang
Hiroshi Saruwatari
Key Contributions
Introduces Shallow Flow Matching (SFM), a novel mechanism that enhances flow matching-based TTS models within a coarse-to-fine paradigm. SFM constructs intermediate states along FM paths and uses adaptive methods for their placement and construction, allowing inference to start from these states, significantly accelerating generation and improving speech naturalness.
Business Value
Enables faster and more natural-sounding synthetic speech generation, improving user experiences in voice assistants, audiobooks, virtual characters, and accessibility tools.