Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The rapid growth of streaming video applications demands multimodal models
with enhanced capabilities for temporal dynamics understanding and complex
reasoning. However, current Video Question Answering (VideoQA) datasets suffer
from two critical limitations: 1) Static annotation mechanisms fail to capture
the evolving nature of answers in temporal video streams, and 2) The absence of
explicit reasoning process annotations restricts model interpretability and
logical deduction capabilities. To address these challenges, We introduce
StreamingCoT, the first dataset explicitly designed for temporally evolving
reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our
framework first establishes a dynamic hierarchical annotation architecture that
generates per-second dense descriptions and constructs temporally-dependent
semantic segments through similarity fusion, paired with question-answer sets
constrained by temporal evolution patterns. We further propose an explicit
reasoning chain generation paradigm that extracts spatiotemporal objects via
keyframe semantic alignment, derives object state transition-based reasoning
paths using large language models, and ensures logical coherence through
human-verified validation. This dataset establishes a foundation for advancing
research in streaming video understanding, complex temporal reasoning, and
multimodal inference. Our StreamingCoT and its construction toolkit can be
accessed at https://github.com/Fleeting-hyh/StreamingCoT.
Authors (8)
Yuhang Hu
Zhenyu Yang
Shihan Wang
Shengsheng Qian
Bin Wen
Fan Yang
+2 more
Submitted
October 29, 2025
Key Contributions
Introduces StreamingCoT, the first dataset for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. It addresses limitations of static annotations by using a dynamic hierarchical architecture for per-second descriptions and temporally-dependent semantic segments, enabling better model interpretability and logical deduction.
Business Value
Enables the development of more sophisticated AI systems that can understand and reason about dynamic video content in real-time, crucial for applications like automated video summarization, content analysis, and interactive media.