arxiv_cv 95% Match Research Paper AI Researchers,Robotics Engineers,VR/AR Developers,ML Engineers 1 week ago

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

large-language-models › reasoning

📄 Abstract

Abstract: Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning RFT to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at https://github.com/InternRobotics/EgoThinker.

Authors (8)

Baoqi Pei

Yifei Huang

Jilan Xu

Yuping He

Guo Chen

Fei Wu

+2 more

Submitted

October 27, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces EgoThinker, a framework enabling MLLMs to perform egocentric video reasoning by leveraging spatio-temporal CoT supervision and a two-stage learning curriculum (SFT+RFT). It also introduces EgoRe-5M, a large-scale egocentric QA dataset with detailed rationales and grounding, bridging the gap in embodied understanding.

Business Value

Enhances the ability of AI systems, particularly robots and virtual agents, to understand and interact with the world from a first-person perspective, crucial for applications in robotics, VR/AR, and human-AI collaboration.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

Moderate. Requires significant computational resources for MLLMs and specialized video processing, along with integration into embodied systems.

Limitations Addressed

MLLMs' limitations in embodied, first-person understanding,Difficulty in reasoning about hidden intentions and fine-grained interactions from egocentric videos,Lack of large-scale egocentric reasoning datasets

Technical Tags

egocentric video reasoningmultimodal large language modelsspatio-temporal chain-of-thoughtembodied AIfirst-person understandingreinforcement fine-tuninghand-object groundinglarge-scale datasetSFTRFT

Research Topics

Multimodal AIVideo UnderstandingReasoning in AIEmbodied AIFoundation Models

Methods & Architectures

EgoThinker frameworkSpatio-temporal Chain-of-Thought (CoT) supervisionTwo-stage learning curriculum (SFT + RFT)Supervised Fine-Tuning (SFT)Reinforcement Fine-Tuning (RFT)Hand-object grounding Multimodal Large Language Model (MLLM)

Applications & Tasks

Robotics Virtual Reality Augmented Reality Human-Computer Interaction ReasoningEgocentric Video UnderstandingEmbodied Cognition Egocentric Video ReasoningInferring agent intentionsRecognizing fine-grained interactions

Datasets & Benchmarks

Datasets

EgoRe-5M

Related Fields

Computer VisionNatural Language ProcessingRoboticsReinforcement LearningArtificial Intelligence

Keywords

egocentric videoreasoningmultimodal LLMschain-of-thoughtembodied AIfirst-personroboticsSFTRFTdatasetvideo understandingspatio-temporal

Academic Context

#Multimodal AI#Video Understanding#Reasoning in AI#Embodied AI#Foundation Models

Commercial Potential

Potential Products

AI agents for robotic controlMore immersive VR/AR experiencesTools for analyzing human behavior from first-person perspectives

Target Industries

RoboticsGamingVirtual RealityHuman-Computer Interaction

Use Case Examples

Robots learning tasks by observing human actions from their own perspectiveVirtual characters understanding and reacting to the player's viewpointAnalyzing user interaction in AR applications

Competitive Edge

Addresses a key limitation of current MLLMs by enabling egocentric reasoning, offering a more embodied understanding of the world compared to models focused solely on third-person perspectives.

Market Opportunity

Growing (robotics, VR/AR markets)

Revenue Models

Licensing of modelsspecialized AI development services

Resource Requirements

Compute Needs

High (for training and inference of large MLLMs)

Data Requirements

Requires large-scale egocentric video datasets with detailed annotations for reasoning and grounding.

Deployment Constraints

Computational cost,Latency for real-time applications,Integration with robotic hardware or VR/AR platforms

Scalability

Scalability depends on the underlying MLLM architecture and efficient spatio-temporal reasoning techniques.

Regulatory Considerations

Low to Moderate (depending on applicatione.g.robotics safety)

Production Readiness

Maturity Level

Research

Time to Market

2-4 years

Patent Potential

Moderate (novel framework and dataset)

View Full Paper Back to Papers