Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: This paper introduces a new paradigm for generative error correction (GER)
framework in audio-visual speech recognition (AVSR) that reasons over
modality-specific evidences directly in the language space. Our framework,
DualHyp, empowers a large language model (LLM) to compose independent N-best
hypotheses from separate automatic speech recognition (ASR) and visual speech
recognition (VSR) models. To maximize the effectiveness of DualHyp, we further
introduce RelPrompt, a noise-aware guidance mechanism that provides
modality-grounded prompts to the LLM. RelPrompt offers the temporal reliability
of each modality stream, guiding the model to dynamically switch its focus
between ASR and VSR hypotheses for an accurate correction. Under various
corruption scenarios, our framework attains up to 57.7% error rate gain on the
LRS2 benchmark over standard ASR baseline, contrary to single-stream GER
approaches that achieve only 10% gain. To facilitate research within our
DualHyp framework, we release the code and the dataset comprising ASR and VSR
hypotheses at https://github.com/sungnyun/dualhyp.
Key Contributions
Introduces DualHyp, a novel generative error correction framework for AVSR that leverages an LLM to compose N-best hypotheses from separate ASR and VSR models. It also proposes RelPrompt, a noise-aware guidance mechanism that dynamically directs the LLM's focus based on modality reliability, achieving significant error rate gains over single-stream approaches.
Business Value
Significantly improves the accuracy and robustness of speech recognition systems, leading to better user experiences in voice assistants, transcription services, and communication tools, especially in challenging acoustic environments.