arxiv_ai 95% Match Research Speech synthesis researchers,NLP engineers,Developers of voice assistants and AI characters,Audio engineers 2 weeks ago

DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech

speech-audio › text-to-speech

📄 Abstract

Abstract: Cross-speaker emotion transfer in speech synthesis relies on extracting speaker-independent emotion embeddings for accurate emotion modeling without retaining speaker traits. However, existing timbre compression methods fail to fully separate speaker and emotion characteristics, causing speaker leakage and degraded synthesis quality. To address this, we propose DiEmo-TTS, a self-supervised distillation method to minimize emotional information loss and preserve speaker identity. We introduce cluster-driven sampling and information perturbation to preserve emotion while removing irrelevant factors. To facilitate this process, we propose an emotion clustering and matching approach using emotional attribute prediction and speaker embeddings, enabling generalization to unlabeled data. Additionally, we designed a dual conditioning transformer to integrate style features better. Experimental results confirm the effectiveness of our method in learning speaker-irrelevant emotion embeddings.

Authors (4)

Deok-Hyeon Cho

Hyung-Seok Oh

Seung-Bin Kim

Seong-Whan Lee

Submitted

May 26, 2025

arXiv Category

cs.SD

arXiv PDF

Key Contributions

Introduces DiEmo-TTS, a self-supervised distillation method for cross-speaker emotion transfer in TTS that effectively separates speaker and emotion characteristics. It uses cluster-driven sampling, information perturbation, and emotion clustering/matching to preserve emotion while removing speaker traits, enabling generalization to unlabeled data and better style integration via a dual conditioning transformer.

Business Value

Enables the creation of more expressive and versatile synthetic voices for applications like virtual assistants, character voices in games/animation, and personalized audio content, enhancing user engagement and accessibility.

Paper Metadata

Innovation Type

Methodological

Deployment Feasibility

Moderate. Requires significant computational resources for training. Inference might be optimized for real-time use, but complexity could be a factor.

Limitations Addressed

Speaker leakage in cross-speaker emotion transfer,Degraded synthesis quality due to imperfect separation,Difficulty in generalizing emotion modeling,Limited ability to fully separate speaker and emotion characteristics

Technical Tags

text-to-speech (TTS)emotion transfercross-speaker synthesisself-supervised learningdistillationspeaker identityemotion embeddingsclusteringtransformerstyle transfer

Research Topics

Speech SynthesisEmotion RecognitionSelf-Supervised LearningDeep LearningSpeaker Adaptation

Methods & Architectures

Self-supervised distillationCluster-driven samplingInformation perturbationEmotion clustering and matchingEmotional attribute predictionSpeaker embedding extractionDual conditioning transformer TransformerEncoder-Decoder (implied for TTS)

Applications & Tasks

Speech Synthesis Virtual Assistants Dubbing Gaming Accessibility Cross-speaker emotion transferSeparating speaker identity from emotionPreserving speaker identity during emotion transferGeneralizing emotion modeling to unlabeled data Synthesizing speech with target emotions from different speakersControlling emotional expression in TTSCreating diverse voice personas

Related Fields

Speech SynthesisSpeech Emotion RecognitionDeep LearningSelf-Supervised LearningNatural Language ProcessingAudio Processing

Keywords

text-to-speechTTSemotion transfercross-speakerself-superviseddistillationspeaker identityemotionclusteringtransformervoice synthesisspeech generation

Academic Context

#Speech Synthesis#Emotion Recognition#Self-Supervised Learning#Deep Learning#Speaker Adaptation

Technology Stack

Frameworks & Libraries

PyTorchTensorFlow

Programming Languages

Python

Commercial Potential

Potential Products

Customizable TTS engines with emotional controlVoice cloning services with emotional rangeAI-powered dubbing tools

Target Industries

TechnologyEntertainmentGamingMediaCustomer Service

Use Case Examples

Creating emotionally expressive voices for virtual assistantsGenerating diverse character voices for video gamesAutomating dubbing for films and seriesDeveloping personalized audiobooks

Competitive Edge

Achieves better separation of speaker identity and emotion compared to existing timbre compression methods, leading to higher quality and more controllable emotional speech synthesis.

Market Opportunity

Growing market for personalized and expressive synthetic voices.

Revenue Models

API access feeslicensing of TTS enginescustom voice development services.

Resource Requirements

Compute Needs

High, particularly for training the distillation process and the dual conditioning transformer.

Data Requirements

Requires diverse speech datasets with labeled emotions and speaker identities. Unlabeled data is also leveraged through self-supervision.

Deployment Constraints

Computational cost for inference,Need for large, diverse training datasets,Controlling subtle emotional nuances

Scalability

Scalability depends on the efficiency of the transformer architecture and the distillation process. Inference can be optimized.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for a refined product.

View Full Paper Back to Papers