Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary
information to extract a target speaker's speech from mixed audio. In
real-world scenarios, there often exist complex acoustic environments,
accompanied by various interfering sounds and reverberation. Most previous
methods struggle to cope with such complex conditions, resulting in poor
perceptual quality of the extracted speech. In this paper, we propose an
effective AVSE system that performs well in complex acoustic environments.
Specifically, we design a "separation before dereverberation" pipeline that can
be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech
Enhancement Challenge (AVSEC) aims to explore new approaches to speech
processing in multimodal complex environments. We validated the performance of
our system in AVSEC-4: we achieved excellent results in the three objective
metrics on the competition leaderboard, and ultimately secured first place in
the human subjective listening test.
Authors (7)
Jiarong Du
Zhan Jin
Peijun Yang
Juan Liu
Zhuo Li
Xin Liu
+1 more
Submitted
October 29, 2025
Key Contributions
This paper proposes an effective Audio-Visual Speech Enhancement (AVSE) system that excels in complex acoustic environments by employing a novel 'separation before dereverberation' pipeline. This approach, validated in the AVSEC-4 challenge, significantly improves speech quality and intelligibility, achieving first place in human subjective evaluations.
Business Value
Improved speech clarity in noisy environments is crucial for applications like voice assistants, teleconferencing, and hearing aids, enhancing user experience and accessibility.