Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Cross-speaker emotion transfer in speech synthesis relies on extracting
speaker-independent emotion embeddings for accurate emotion modeling without
retaining speaker traits. However, existing timbre compression methods fail to
fully separate speaker and emotion characteristics, causing speaker leakage and
degraded synthesis quality. To address this, we propose DiEmo-TTS, a
self-supervised distillation method to minimize emotional information loss and
preserve speaker identity. We introduce cluster-driven sampling and information
perturbation to preserve emotion while removing irrelevant factors. To
facilitate this process, we propose an emotion clustering and matching approach
using emotional attribute prediction and speaker embeddings, enabling
generalization to unlabeled data. Additionally, we designed a dual conditioning
transformer to integrate style features better. Experimental results confirm
the effectiveness of our method in learning speaker-irrelevant emotion
embeddings.
Authors (4)
Deok-Hyeon Cho
Hyung-Seok Oh
Seung-Bin Kim
Seong-Whan Lee
Key Contributions
Introduces DiEmo-TTS, a self-supervised distillation method for cross-speaker emotion transfer in TTS that effectively separates speaker and emotion characteristics. It uses cluster-driven sampling, information perturbation, and emotion clustering/matching to preserve emotion while removing speaker traits, enabling generalization to unlabeled data and better style integration via a dual conditioning transformer.
Business Value
Enables the creation of more expressive and versatile synthetic voices for applications like virtual assistants, character voices in games/animation, and personalized audio content, enhancing user engagement and accessibility.