Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 90% Match Research paper AI researchers,Video analysis specialists,NLP researchers,Developers of multimodal systems 1 week ago

StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

large-language-models › reasoning
📄 Abstract

Abstract: The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at https://github.com/Fleeting-hyh/StreamingCoT.
Authors (8)
Yuhang Hu
Zhenyu Yang
Shihan Wang
Shengsheng Qian
Bin Wen
Fan Yang
+2 more
Submitted
October 29, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Introduces StreamingCoT, the first dataset for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. It addresses limitations of static annotations by using a dynamic hierarchical architecture for per-second descriptions and temporally-dependent semantic segments, enabling better model interpretability and logical deduction.

Business Value

Enables the development of more sophisticated AI systems that can understand and reason about dynamic video content in real-time, crucial for applications like automated video summarization, content analysis, and interactive media.