Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 90% Match Research Speech recognition researchers,Affective computing specialists,HCI researchers,Developers of interactive AI systems,Call center analytics providers 2 weeks ago

EmoSphere-SER: Enhancing Speech Emotion Recognition Through Spherical Representation with Auxiliary Classification

speech-audio › audio-generation
📄 Abstract

Abstract: Speech emotion recognition predicts a speaker's emotional state from speech signals using discrete labels or continuous dimensions such as arousal, valence, and dominance (VAD). We propose EmoSphere-SER, a joint model that integrates spherical VAD region classification to guide VAD regression for improved emotion prediction. In our framework, VAD values are transformed into spherical coordinates that are divided into multiple spherical regions, and an auxiliary classification task predicts which spherical region each point belongs to, guiding the regression process. Additionally, we incorporate a dynamic weighting scheme and a style pooling layer with multi-head self-attention to capture spectral and temporal dynamics, further boosting performance. This combined training strategy reinforces structured learning and improves prediction consistency. Experimental results show that our approach exceeds baseline methods, confirming the validity of the proposed framework.
Authors (4)
Deok-Hyeon Cho
Hyung-Seok Oh
Seung-Bin Kim
Seong-Whan Lee
Submitted
May 26, 2025
arXiv Category
cs.SD
arXiv PDF

Key Contributions

Proposes EmoSphere-SER, a novel joint model for speech emotion recognition that enhances VAD regression by incorporating spherical VAD region classification as an auxiliary task. This guides the regression process, improving prediction consistency and accuracy. It also uses dynamic weighting and style pooling with multi-head self-attention to capture temporal and spectral dynamics.

Business Value

Enables more accurate understanding of customer emotions in call centers, improves the expressiveness of virtual agents, and can be used for monitoring user sentiment in various interactive applications.