Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper AI Researchers,Robotics Engineers,VR/AR Developers,ML Engineers 1 week ago

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

large-language-models › reasoning
📄 Abstract

Abstract: Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning RFT to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at https://github.com/InternRobotics/EgoThinker.
Authors (8)
Baoqi Pei
Yifei Huang
Jilan Xu
Yuping He
Guo Chen
Fei Wu
+2 more
Submitted
October 27, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Introduces EgoThinker, a framework enabling MLLMs to perform egocentric video reasoning by leveraging spatio-temporal CoT supervision and a two-stage learning curriculum (SFT+RFT). It also introduces EgoRe-5M, a large-scale egocentric QA dataset with detailed rationales and grounding, bridging the gap in embodied understanding.

Business Value

Enhances the ability of AI systems, particularly robots and virtual agents, to understand and interact with the world from a first-person perspective, crucial for applications in robotics, VR/AR, and human-AI collaboration.