Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper Speech Recognition Researchers,AI Engineers,HCI Researchers,Developers of voice-enabled applications 2 weeks ago

Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses

speech-audio › speech-recognition
📄 Abstract

Abstract: This paper introduces a new paradigm for generative error correction (GER) framework in audio-visual speech recognition (AVSR) that reasons over modality-specific evidences directly in the language space. Our framework, DualHyp, empowers a large language model (LLM) to compose independent N-best hypotheses from separate automatic speech recognition (ASR) and visual speech recognition (VSR) models. To maximize the effectiveness of DualHyp, we further introduce RelPrompt, a noise-aware guidance mechanism that provides modality-grounded prompts to the LLM. RelPrompt offers the temporal reliability of each modality stream, guiding the model to dynamically switch its focus between ASR and VSR hypotheses for an accurate correction. Under various corruption scenarios, our framework attains up to 57.7% error rate gain on the LRS2 benchmark over standard ASR baseline, contrary to single-stream GER approaches that achieve only 10% gain. To facilitate research within our DualHyp framework, we release the code and the dataset comprising ASR and VSR hypotheses at https://github.com/sungnyun/dualhyp.

Key Contributions

Introduces DualHyp, a novel generative error correction framework for AVSR that leverages an LLM to compose N-best hypotheses from separate ASR and VSR models. It also proposes RelPrompt, a noise-aware guidance mechanism that dynamically directs the LLM's focus based on modality reliability, achieving significant error rate gains over single-stream approaches.

Business Value

Significantly improves the accuracy and robustness of speech recognition systems, leading to better user experiences in voice assistants, transcription services, and communication tools, especially in challenging acoustic environments.