Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper ASR researchers,Developers of transcription services,Educators,Conference organizers 3 weeks ago

Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks

speech-audio › speech-recognition
📄 Abstract

Abstract: State-of-the-art (SOTA) Automatic Speech Recognition (ASR) systems primarily rely on acoustic information while disregarding additional multi-modal context. However, visual information are essential in disambiguation and adaptation. While most work focus on speaker images to handle noise conditions, this work also focuses on integrating presentation slides for the use cases of scientific presentation. In a first step, we create a benchmark for multi-modal presentation including an automatic analysis of transcribing domain-specific terminology. Next, we explore methods for augmenting speech models with multi-modal information. We mitigate the lack of datasets with accompanying slides by a suitable approach of data augmentation. Finally, we train a model using the augmented dataset, resulting in a relative reduction in word error rate of approximately 34%, across all words and 35%, for domain-specific terms compared to the baseline model.
Authors (2)
Supriti Sinhamahapatra
Jan Niehues
Submitted
October 15, 2025
arXiv Category
cs.AI
arXiv PDF

Key Contributions

This work introduces a multi-modal approach to Automatic Speech Recognition (ASR) for conference talks by integrating presentation slides with acoustic information. It addresses the lack of multi-modal datasets through data augmentation and demonstrates significant reductions in word error rate (WER), particularly for domain-specific terms, improving the accuracy of transcribing scientific presentations.

Business Value

Automates the creation of searchable transcripts for lectures and presentations, enhancing accessibility, discoverability, and reusability of educational and scientific content.