arxiv_ai 95% Match Research Paper Speech Recognition Researchers,ML Engineers,HCI Developers,AI Researchers 20 hours ago

Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study

speech-audio › audio-generation

📄 Abstract

Abstract: Emotion recognition from speech plays a vital role in the development of empathetic human-computer interaction systems. This paper presents a comparative analysis of lightweight transformer-based models, DistilHuBERT and PaSST, by classifying six core emotions from the CREMA-D dataset. We benchmark their performance against a traditional CNN-LSTM baseline model using MFCC features. DistilHuBERT demonstrates superior accuracy (70.64%) and F1 score (70.36%) while maintaining an exceptionally small model size (0.02 MB), outperforming both PaSST and the baseline. Furthermore, we conducted an ablation study on three variants of the PaSST, Linear, MLP, and Attentive Pooling heads, to understand the effect of classification head architecture on model performance. Our results indicate that PaSST with an MLP head yields the best performance among its variants but still falls short of DistilHuBERT. Among the emotion classes, angry is consistently the most accurately detected, while disgust remains the most challenging. These findings suggest that lightweight transformers like DistilHuBERT offer a compelling solution for real-time speech emotion recognition on edge devices. The code is available at: https://github.com/luckymaduabuchi/Emotion-detection-.

Key Contributions

Compares lightweight transformer models (DistilHuBERT, PaSST) for speech emotion detection, finding DistilHuBERT to be superior in accuracy and F1 score with minimal model size. It also performs an ablation study on PaSST's classification heads.

Business Value

Enables the development of more empathetic and responsive AI systems (e.g., virtual assistants, customer service bots) by accurately detecting user emotions from speech.

Paper Metadata

Innovation Type

Comparative Study / Model Evaluation

Deployment Feasibility

High, due to the focus on lightweight models like DistilHuBERT, making them suitable for edge devices or real-time applications.

Limitations Addressed

Need for efficient and accurate models for speech emotion recognition, especially for deployment on resource-constrained devices.

Performance Gains

DistilHuBERT significantly outperforms PaSST and a CNN-LSTM baseline in accuracy and F1 score while being much smaller.

Technical Tags

Emotion DetectionSpeech Emotion RecognitionLightweight ModelsTransformer ModelsDistilHuBERTPaSSTCNN-LSTMMFCC FeaturesAblation StudyComparative Analysis

Research Topics

Speech ProcessingEmotion RecognitionMachine LearningDeep LearningHuman-Computer Interaction

Methods & Architectures

Comparative studyAblation studyClassification using DistilHuBERT, PaSST, CNN-LSTMFeature extraction (MFCC) DistilHuBERTPaSSTCNN-LSTMTransformer-based models

Applications & Tasks

Human-Computer Interaction Call Centers Mental Health Monitoring Affective Computing Emotion Detection in SpeechSpeech Classification Classifying six core emotions from speechComparing performance of lightweight transformer models against baselines

Datasets & Benchmarks

Datasets

CREMA-D dataset

Benchmarks

DistilHuBERT: 70.64% accuracy, 70.36% F1 score • PaSST (MLP head): Lower performance than DistilHuBERT • CNN-LSTM baseline: Lower performance than DistilHuBERT

AccuracyF1 score

Related Fields

Speech ProcessingNatural Language ProcessingMachine LearningDeep LearningAffective ComputingHuman-Computer Interaction

Keywords

Speech Emotion RecognitionEmotion DetectionLightweight ModelsTransformerDistilHuBERTPaSSTCREMA-DMFCCAblation StudyComparative AnalysisAffective Computing

Academic Context

#Speech Processing#Emotion Recognition#Machine Learning#Deep Learning#Human-Computer Interaction

Technology Stack

Frameworks & Libraries

DistilHuBERTPaSST

Commercial Potential

Potential Products

Emotion-aware virtual assistantsCustomer sentiment analysis toolsTools for mental health monitoring

Target Industries

TechnologyCustomer ServiceHealthcareGamingAutomotive

Use Case Examples

Analyzing customer emotions during support callsCreating AI companions that can respond empathetically

Competitive Edge

Identifies highly efficient and effective lightweight transformer models for speech emotion recognition, outperforming traditional methods and heavier transformer variants.

Market Opportunity

Growing market for AI in customer experience and affective computing.

Revenue Models

Integration into existing AI platforms or development of specialized emotion analysis services.

Resource Requirements

Compute Needs

Low for inference with lightweight models; moderate for training.

Data Requirements

Requires labeled speech datasets with emotional annotations (e.g., CREMA-D).

Deployment Constraints

Performance can vary based on audio quality, speaker accents, and the specific emotions being classified.

Scalability

Lightweight models are highly scalable for deployment on various devices.

Regulatory Considerations

Ethical use of emotion detectionparticularly in sensitive contexts like mental health or surveillance.

Production Readiness

Maturity Level

Research / Evaluation

Time to Market

Short-term, models are readily available or can be fine-tuned.

View Full Paper Back to Papers