arxiv_ai 95% Match Research Paper ASR researchers,Developers of transcription services,Educators,Conference organizers 3 weeks ago

Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks

speech-audio › speech-recognition

📄 Abstract

Abstract: State-of-the-art (SOTA) Automatic Speech Recognition (ASR) systems primarily rely on acoustic information while disregarding additional multi-modal context. However, visual information are essential in disambiguation and adaptation. While most work focus on speaker images to handle noise conditions, this work also focuses on integrating presentation slides for the use cases of scientific presentation. In a first step, we create a benchmark for multi-modal presentation including an automatic analysis of transcribing domain-specific terminology. Next, we explore methods for augmenting speech models with multi-modal information. We mitigate the lack of datasets with accompanying slides by a suitable approach of data augmentation. Finally, we train a model using the augmented dataset, resulting in a relative reduction in word error rate of approximately 34%, across all words and 35%, for domain-specific terms compared to the baseline model.

Authors (2)

Supriti Sinhamahapatra

Jan Niehues

Submitted

October 15, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

This work introduces a multi-modal approach to Automatic Speech Recognition (ASR) for conference talks by integrating presentation slides with acoustic information. It addresses the lack of multi-modal datasets through data augmentation and demonstrates significant reductions in word error rate (WER), particularly for domain-specific terms, improving the accuracy of transcribing scientific presentations.

Business Value

Automates the creation of searchable transcripts for lectures and presentations, enhancing accessibility, discoverability, and reusability of educational and scientific content.

Paper Metadata

Innovation Type

Algorithmic/Data

Deployment Feasibility

Moderate. Requires integration with presentation software or recording systems and robust ASR models.

Limitations Addressed

State-of-the-art ASR systems primarily relying on acoustic information and disregarding multi-modal context, leading to lower accuracy for domain-specific terminology and complex presentations.

Performance Gains

34-35% relative reduction in WER compared to baseline models.

Technical Tags

Automatic Speech Recognition (ASR)Multi-modal ContextPresentation SlidesAcoustic InformationVisual InformationDomain-Specific TerminologyData AugmentationWord Error Rate (WER)Scientific PresentationsSpeech Models

Research Topics

Multi-modal AISpeech ProcessingAutomatic TranscriptionMachine Learning for EducationInformation Retrieval

Methods & Architectures

Multi-modal ASRIntegration of acoustic and visual (slide) informationData augmentation for multi-modal datasetsTraining speech models with augmented data

Applications & Tasks

Academic Conferences Scientific Presentations Educational Content Creation Meeting Transcription Improving ASR accuracyHandling domain-specific terminologyDisambiguating speech using visual context Automatic transcription of conference talksAugmenting speech models with presentation slidesReducing word error rate in scientific presentations

Datasets & Benchmarks

Benchmarks

Relative reduction in word error rate of approximately 34% (all words) • Relative reduction in word error rate of 35% (domain-specific terms)

Word Error Rate (WER)

Related Fields

Speech ProcessingComputer VisionNatural Language ProcessingMachine LearningEducation Technology

Keywords

ASRMulti-modalSpeech recognitionPresentation slidesTranscriptionConference talksDomain-specificData augmentationWord error rateAcousticVisualScientific presentations

Academic Context

#Multi-modal AI#Speech Processing#Automatic Transcription#Machine Learning for Education#Information Retrieval

Commercial Potential

Potential Products

Automated lecture transcription servicesMeeting summarization toolsSearchable video archives

Target Industries

EducationMediaConferencingResearch

Use Case Examples

Generating accurate transcripts for online courses and academic lecturesCreating searchable archives of scientific presentations

Competitive Edge

Outperforms traditional acoustic-only ASR by leveraging multi-modal information from presentation slides.

Market Opportunity

Large, driven by the demand for accessible and searchable digital content.

Revenue Models

SaaS subscriptions for transcription servicesAPI access.

Resource Requirements

Compute Needs

Moderate to high, depending on model size and data volume.

Data Requirements

Paired audio and presentation slide data, ideally from scientific or technical presentations.

Deployment Constraints

Availability of synchronized slides and audio,Computational resources for multi-modal processing

Scalability

Scalability depends on the efficiency of the multi-modal fusion and ASR components.

Regulatory Considerations

Data privacy for recorded talks.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years

Patent Potential

Moderate, for the multi-modal fusion techniques and data augmentation strategies.

View Full Paper Back to Papers