Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Speech synthesis researchers,NLP engineers,Developers of voice assistants and AI characters,Audio engineers 2 weeks ago

DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech

speech-audio › text-to-speech
📄 Abstract

Abstract: Cross-speaker emotion transfer in speech synthesis relies on extracting speaker-independent emotion embeddings for accurate emotion modeling without retaining speaker traits. However, existing timbre compression methods fail to fully separate speaker and emotion characteristics, causing speaker leakage and degraded synthesis quality. To address this, we propose DiEmo-TTS, a self-supervised distillation method to minimize emotional information loss and preserve speaker identity. We introduce cluster-driven sampling and information perturbation to preserve emotion while removing irrelevant factors. To facilitate this process, we propose an emotion clustering and matching approach using emotional attribute prediction and speaker embeddings, enabling generalization to unlabeled data. Additionally, we designed a dual conditioning transformer to integrate style features better. Experimental results confirm the effectiveness of our method in learning speaker-irrelevant emotion embeddings.
Authors (4)
Deok-Hyeon Cho
Hyung-Seok Oh
Seung-Bin Kim
Seong-Whan Lee
Submitted
May 26, 2025
arXiv Category
cs.SD
arXiv PDF

Key Contributions

Introduces DiEmo-TTS, a self-supervised distillation method for cross-speaker emotion transfer in TTS that effectively separates speaker and emotion characteristics. It uses cluster-driven sampling, information perturbation, and emotion clustering/matching to preserve emotion while removing speaker traits, enabling generalization to unlabeled data and better style integration via a dual conditioning transformer.

Business Value

Enables the creation of more expressive and versatile synthetic voices for applications like virtual assistants, character voices in games/animation, and personalized audio content, enhancing user engagement and accessibility.