arxiv_cl 85% Match Research Paper AI Researchers,Video Developers,Accessibility Advocates,Media Companies 1 week ago

More than a Moment: Towards Coherent Sequences of Audio Descriptions

large-language-models › multimodal-llms

📄 Abstract

Abstract: Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.

Authors (8)

Eshika Khandelwal

Junyu Xie

Tengda Han

Max Bain

Arsha Nagrani

Andrew Zisserman

+2 more

Submitted

October 29, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

This paper proposes CoherentAD, a training-free method for generating coherent sequences of audio descriptions (ADs) by performing auto-regressive selection across candidate descriptions. It also introduces StoryRecall, a novel sequence-level metric to holistically evaluate AD sequences, addressing the incoherence and repetition issues of existing methods.

Business Value

Enhances video accessibility for visually impaired audiences by providing more understandable and engaging audio descriptions, improving user experience and compliance with accessibility standards.

Paper Metadata

Innovation Type

Algorithmic Improvement & Evaluation Metric

Deployment Feasibility

High, as it's a training-free method that can be applied to existing video content or generation pipelines.

Limitations Addressed

Automatic methods often generate repetitive and incoherent ADs by treating each time interval independently; lack of holistic evaluation metrics for AD sequences.

Performance Gains

Produces coherent AD sequences with enhanced narrative understanding.,Outperforms existing methods (implied).

Technical Tags

audio descriptions (ADs)video understandingsequence generationcoherencenarrative understandingauto-regressive selectiontraining-free methodsequence-level metricStoryRecallrepetition metrics

Research Topics

Multimodal AINatural Language GenerationVideo UnderstandingAccessibilityMachine Learning

Methods & Architectures

Training-free generationAuto-regressive selectionSequence generationCandidate generationHolistic evaluation Sequence-to-sequence modelsLarge Language Models (LLMs)

Applications & Tasks

Video Accessibility Media Production Assistive Technology Sequence GenerationNatural Language GenerationContent SummarizationCoherence Modeling Generating coherent audio descriptionsImproving narrative understanding in videosReducing repetition in generated descriptions

Datasets & Benchmarks

Benchmarks

StoryRecall metric

StoryRecallRepetition metrics

Related Fields

Computer VisionSpeech ProcessingHuman-Computer InteractionAccessibility Research

Keywords

audio descriptionsADsvideo understandingsequence generationcoherencenarrativeaccessibilityvisually impairedLLMsauto-regressivetraining-freeStoryRecallevaluation metricmultimodal AInatural language generation

Academic Context

#Multimodal AI#Natural Language Generation#Video Understanding#Accessibility#Machine Learning

Commercial Potential

Potential Products

Automated audio description generation tools for video platformsEnhanced video accessibility featuresTools for content creators to improve AD quality

Target Industries

Media and EntertainmentStreaming ServicesEdTechAssistive Technology

Use Case Examples

Automatically generating audio descriptions for movies and TV shows.Improving the quality of live-streamed event descriptions.Creating more engaging video content for visually impaired users.

Competitive Edge

Offers a training-free approach that focuses on sequence coherence and uses a novel holistic evaluation metric, differentiating it from methods that generate descriptions independently.

Market Opportunity

Growing demand for accessible digital content.

Revenue Models

Licensing of the technology to video platformsoffering AD generation as a service.

Resource Requirements

Compute Needs

Low to Moderate (training-free implies inference-focused)

Data Requirements

Video datasets with corresponding audio descriptions.

Deployment Constraints

Effectiveness depends on the quality and coverage of the underlying video content and the candidate generation process.

Scalability

Training-free nature makes it highly scalable for inference.

Regulatory Considerations

Accessibility standards (e.g.WCAG) compliance.

Production Readiness

Maturity Level

Research/Development

Time to Market

1-2 years

Patent Potential

Moderate (for the CoherentAD method and StoryRecall metric)

View Full Paper Back to Papers