Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 85% Match Research Paper AI Researchers,Video Developers,Accessibility Advocates,Media Companies 1 week ago

More than a Moment: Towards Coherent Sequences of Audio Descriptions

large-language-models › multimodal-llms
📄 Abstract

Abstract: Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.
Authors (8)
Eshika Khandelwal
Junyu Xie
Tengda Han
Max Bain
Arsha Nagrani
Andrew Zisserman
+2 more
Submitted
October 29, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

This paper proposes CoherentAD, a training-free method for generating coherent sequences of audio descriptions (ADs) by performing auto-regressive selection across candidate descriptions. It also introduces StoryRecall, a novel sequence-level metric to holistically evaluate AD sequences, addressing the incoherence and repetition issues of existing methods.

Business Value

Enhances video accessibility for visually impaired audiences by providing more understandable and engaging audio descriptions, improving user experience and compliance with accessibility standards.