Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We introduce the task of grounded article generation with the goal of
creating a Wikipedia-style article from multiple diverse videos about
real-world events -- from natural disasters to political elections -- where all
the information in the article is supported by video evidence. Videos are
intuitive sources for retrieval-augmented generation (RAG), but most
contemporary RAG workflows focus heavily on text while existing methods for
video-based summarization focus on low-level scene understanding rather than
high-level event semantics. To close this gap, we introduce WikiVideo, a
benchmark consisting of expert-written articles and densely annotated videos
that provide evidence for articles' claims, facilitating the integration of
video into RAG pipelines and enabling the creation of in-depth content that is
grounded in multimodal sources. We further propose Collaborative Article
Generation (CAG), a novel interactive method for article creation from multiple
videos. CAG leverages an iterative interaction between an r1-style reasoning
model and a VideoLLM to draw higher-level inferences about the target event
than is possible with VideoLLMs alone, which fixate on low-level visual
features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle
retrieval and RAG settings and find that CAG consistently outperforms
alternative methods, while suggesting intriguing avenues for future work.
Authors (8)
Alexander Martin
Reno Kriz
William Gantt Walden
Kate Sanders
Hannah Recknor
Eugene Yang
+2 more
Key Contributions
Introduces WikiVideo, a benchmark for grounded article generation from multiple videos, and proposes Collaborative Article Generation (CAG), an interactive method. This work addresses the gap in RAG by focusing on high-level event semantics in videos, enabling the creation of articles fully supported by video evidence.
Business Value
Automates the creation of informative articles from video content, valuable for news organizations, educational platforms, and content creators.