arxiv_ai 95% Match Research Paper AI Researchers,Computer Vision Engineers,NLP Engineers,Developers of video analysis tools 2 weeks ago

Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

large-language-models › multimodal-llms

📄 Abstract

Abstract: Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.

Authors (11)

Jiahao Meng

Xiangtai Li

Haochen Wang

Yue Tan

Tao Zhang

Lingdong Kong

+5 more

Submitted

October 23, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces Open-o3 Video, a framework for grounded video reasoning that integrates explicit spatio-temporal evidence (timestamps, objects, bounding boxes) into reasoning traces. It addresses challenges in temporal tracking and spatial localization and includes curated datasets for supervised fine-tuning and reinforcement learning.

Business Value

Enables more transparent and verifiable AI systems for video analysis, improving trust and utility in applications like content moderation, security, and autonomous systems.

Paper Metadata

Innovation Type

Framework/Methodological

Deployment Feasibility

Moderate, requires significant data annotation and specialized model training.

Limitations Addressed

Most video reasoning models generate textual traces without indicating specific spatio-temporal evidence, hindering grounding.

Technical Tags

Video ReasoningSpatio-temporal EvidenceGrounded ReasoningOpen-o3 VideoLLMsTemporal TrackingSpatial LocalizationBounding BoxesTimestamp AnnotationMultimodal AI

Research Topics

Video UnderstandingMultimodal ReasoningExplainable AIGrounded Language GenerationComputer Vision

Methods & Architectures

Non-agent FrameworkExplicit Spatio-Temporal Evidence IntegrationData Curation and CollectionTraining Strategy DesignFine-tuningReinforcement Learning (for STGR-RL-36k) Large Language Models (LLMs)Multimodal Models

Applications & Tasks

Video Analysis Content Understanding Surveillance Robotics Lack of explicit spatio-temporal evidence in video reasoning tracesDifficulty in joint temporal tracking and spatial localizationGrounding reasoning in visual observations Grounded video reasoningGenerating reasoning traces with explicit evidenceHighlighting key timestamps, objects, and bounding boxesVideo question answering

Datasets & Benchmarks

Datasets

STGR-CoT-30k, STGR-RL-36k

Related Fields

Computer VisionNatural Language ProcessingVideo ProcessingExplainable AIDeep Learning

Keywords

Video ReasoningGrounded AISpatio-temporalOpen-o3 VideoMultimodal LLMVideo AnalysisExplainable AIObject DetectionTemporal LocalizationComputer Vision

Academic Context

#Video Understanding#Multimodal Reasoning#Explainable AI#Grounded Language Generation#Computer Vision

Companies & Organizations

Companies Mentioned

OpenAI

Commercial Potential

Potential Products

Video analysis platforms with explainable reasoningTools for automated video summarization and content taggingAI systems for autonomous driving and robotics

Target Industries

MediaSecurityAutomotiveRoboticsTechnology

Use Case Examples

Providing detailed explanations for autonomous vehicle decisions based on video input.Automated content moderation for video platforms, highlighting specific moments and objects.Generating summaries of surveillance footage with precise timestamps and object identification.

Competitive Edge

Advances video reasoning by explicitly grounding it in spatio-temporal evidence, offering greater transparency and interpretability than models that only produce textual outputs.

Market Opportunity

Significant growth in video analytics and AI-powered content understanding.

Revenue Models

SaaS solutionsAPI access for video analysis.

Resource Requirements

Compute Needs

High for training, moderate for inference.

Data Requirements

Requires curated datasets with detailed spatio-temporal annotations (STGR-CoT-30k, STGR-RL-36k).

Deployment Constraints

Complexity of spatio-temporal annotation and model training.

Scalability

Scalability depends on the efficiency of the video processing and reasoning components.

Regulatory Considerations

Privacy concerns with video analysisbias in datasets.

Production Readiness

Maturity Level

Research/Development

Time to Market

2-3 years

Licensing

Not specified

Patent Potential

Moderate

View Full Paper Back to Papers