Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 90% Match Research Paper AI researchers,NLP engineers,Developers of affective computing systems,Speech technology developers 1 week ago

Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

large-language-models β€Ί multimodal-llms
πŸ“„ Abstract

Abstract: Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models' generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.
Authors (5)
Pedro CorrΓͺa
JoΓ£o Lima
Victor Moreno
Lucas Ueda
Paula Dornhofer Paro Costa
Submitted
October 29, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

Evaluates the performance of four Spoken Language Models (SLMs) on speech emotion recognition using emotionally incongruent speech. The study reveals that SLMs predominantly rely on textual semantics rather than acoustic cues for emotion recognition, indicating a dominance of text-related representations.

Business Value

Provides critical insights into the limitations of current spoken language models for understanding nuanced human emotions, guiding the development of more robust and truly multimodal AI systems for applications like customer service or mental health monitoring.