Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Speech emotion recognition predicts a speaker's emotional state from speech
signals using discrete labels or continuous dimensions such as arousal,
valence, and dominance (VAD). We propose EmoSphere-SER, a joint model that
integrates spherical VAD region classification to guide VAD regression for
improved emotion prediction. In our framework, VAD values are transformed into
spherical coordinates that are divided into multiple spherical regions, and an
auxiliary classification task predicts which spherical region each point
belongs to, guiding the regression process. Additionally, we incorporate a
dynamic weighting scheme and a style pooling layer with multi-head
self-attention to capture spectral and temporal dynamics, further boosting
performance. This combined training strategy reinforces structured learning and
improves prediction consistency. Experimental results show that our approach
exceeds baseline methods, confirming the validity of the proposed framework.
Authors (4)
Deok-Hyeon Cho
Hyung-Seok Oh
Seung-Bin Kim
Seong-Whan Lee
Key Contributions
Proposes EmoSphere-SER, a novel joint model for speech emotion recognition that enhances VAD regression by incorporating spherical VAD region classification as an auxiliary task. This guides the regression process, improving prediction consistency and accuracy. It also uses dynamic weighting and style pooling with multi-head self-attention to capture temporal and spectral dynamics.
Business Value
Enables more accurate understanding of customer emotions in call centers, improves the expressiveness of virtual agents, and can be used for monitoring user sentiment in various interactive applications.