Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: State-of-the-art (SOTA) Automatic Speech Recognition (ASR) systems primarily
rely on acoustic information while disregarding additional multi-modal context.
However, visual information are essential in disambiguation and adaptation.
While most work focus on speaker images to handle noise conditions, this work
also focuses on integrating presentation slides for the use cases of scientific
presentation.
In a first step, we create a benchmark for multi-modal presentation including
an automatic analysis of transcribing domain-specific terminology. Next, we
explore methods for augmenting speech models with multi-modal information. We
mitigate the lack of datasets with accompanying slides by a suitable approach
of data augmentation. Finally, we train a model using the augmented dataset,
resulting in a relative reduction in word error rate of approximately 34%,
across all words and 35%, for domain-specific terms compared to the baseline
model.
Authors (2)
Supriti Sinhamahapatra
Jan Niehues
Submitted
October 15, 2025
Key Contributions
This work introduces a multi-modal approach to Automatic Speech Recognition (ASR) for conference talks by integrating presentation slides with acoustic information. It addresses the lack of multi-modal datasets through data augmentation and demonstrates significant reductions in word error rate (WER), particularly for domain-specific terms, improving the accuracy of transcribing scientific presentations.
Business Value
Automates the creation of searchable transcripts for lectures and presentations, enhancing accessibility, discoverability, and reusability of educational and scientific content.