Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Audio Descriptions (ADs) convey essential on-screen information, allowing
visually impaired audiences to follow videos. To be effective, ADs must form a
coherent sequence that helps listeners to visualise the unfolding scene, rather
than describing isolated moments. However, most automatic methods generate each
AD independently, often resulting in repetitive, incoherent descriptions. To
address this, we propose a training-free method, CoherentAD, that first
generates multiple candidate descriptions for each AD time interval, and then
performs auto-regressive selection across the sequence to form a coherent and
informative narrative. To evaluate AD sequences holistically, we introduce a
sequence-level metric, StoryRecall, which measures how well the predicted ADs
convey the ground truth narrative, alongside repetition metrics that capture
the redundancy across consecutive AD outputs. Our method produces coherent AD
sequences with enhanced narrative understanding, outperforming prior approaches
that rely on independent generations.
Authors (8)
Eshika Khandelwal
Junyu Xie
Tengda Han
Max Bain
Arsha Nagrani
Andrew Zisserman
+2 more
Submitted
October 29, 2025
Key Contributions
This paper proposes CoherentAD, a training-free method for generating coherent sequences of audio descriptions (ADs) by performing auto-regressive selection across candidate descriptions. It also introduces StoryRecall, a novel sequence-level metric to holistically evaluate AD sequences, addressing the incoherence and repetition issues of existing methods.
Business Value
Enhances video accessibility for visually impaired audiences by providing more understandable and engaging audio descriptions, improving user experience and compliance with accessibility standards.