arxiv_cl 90% Match Research Paper AI researchers,NLP engineers,Developers of affective computing systems,Speech technology developers 1 week ago

Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

large-language-models › multimodal-llms

📄 Abstract

Abstract: Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models' generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.

Authors (5)

Pedro Corrêa

João Lima

Victor Moreno

Lucas Ueda

Paula Dornhofer Paro Costa

Submitted

October 29, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Evaluates the performance of four Spoken Language Models (SLMs) on speech emotion recognition using emotionally incongruent speech. The study reveals that SLMs predominantly rely on textual semantics rather than acoustic cues for emotion recognition, indicating a dominance of text-related representations.

Business Value

Provides critical insights into the limitations of current spoken language models for understanding nuanced human emotions, guiding the development of more robust and truly multimodal AI systems for applications like customer service or mental health monitoring.

Paper Metadata

Innovation Type

Evaluation Methodology

Deployment Feasibility

Moderate, as the findings highlight limitations in current models, suggesting further research is needed before widespread deployment for sensitive emotion-aware tasks.

Limitations Addressed

Addresses the growing discussion regarding the generalization capabilities of SLMs and the extent to which they truly integrate audio and text modalities, specifically in the context of emotion recognition with conflicting cues.

Technical Tags

spoken language modelsemotion recognitionincongruent speechaudio-text integrationgeneralization capabilitiestextual semanticsacoustic representationstransformerrepresentation learning

Research Topics

Spoken Language UnderstandingEmotion RecognitionMultimodal LearningModel GeneralizationRepresentation Learning

Methods & Architectures

Evaluation of SLMsSpeech emotion recognition taskAnalysis of textual vs. acoustic reliance Spoken Language Models (SLMs)Transformer-based models

Applications & Tasks

Speech Analysis Human-Computer Interaction Affective Computing Evaluating emotion recognition in SLMsUnderstanding modality dominance in SLMsAssessing generalization on emotionally incongruent speech Speech Emotion Recognition

Datasets & Benchmarks

Datasets

Emotionally incongruent speech dataset

Related Fields

Natural Language ProcessingSpeech ProcessingMachine LearningAffective ComputingHuman-Computer Interaction

Keywords

Spoken Language ModelsEmotion RecognitionSpeech EmotionMultimodal AITextual SemanticsAcoustic FeaturesGeneralizationIncongruent SpeechRepresentation LearningTransformer ModelsAI Evaluation

Academic Context

#Spoken Language Understanding#Emotion Recognition#Multimodal Learning#Model Generalization#Representation Learning

Commercial Potential

Potential Products

More accurate emotion-aware chatbotsAdvanced sentiment analysis toolsAI-powered mental health support systems

Target Industries

TechnologyCustomer ServiceHealthcareMarket Research

Use Case Examples

Analyzing customer sentiment in call centersDeveloping empathetic AI companionsDetecting emotional distress in spoken interactions

Competitive Edge

Highlights a gap in current SLMs' ability to truly integrate multimodal emotional cues, suggesting a need for models that go beyond text-dominant processing for affective tasks.

Production Readiness

Maturity Level

Research Evaluation

Licensing

Code released

View Full Paper Back to Papers