Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper AI Researchers,Robotics Engineers,Computer Vision Engineers,System Developers 2 weeks ago

Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video

large-language-models › multimodal-llms
📄 Abstract

Abstract: Envision an AI capable of functioning in human-like settings, moving beyond mere observation to actively understand, anticipate, and proactively respond to unfolding events. Towards this vision, we focus on the innovative task where, given ego-streaming video input, an assistant proactively answers diverse, evolving questions at the opportune moment, while maintaining synchronized perception and reasoning. This task embodies three key properties: (1) Proactive Coherence, (2) Just-in-Time Responsiveness, and (3) Synchronized Efficiency. To evaluate and address these properties, we first introduce ESTP-Bench (Ego Streaming Proactive Benchmark) alongside the ESTP-F1 metric-a novel framework designed for their rigorous assessment. Secondly, we propose a comprehensive technical pipeline to enable models to tackle this challenging task. This pipeline comprises: (1) a data engine, (2) a multi-stage training strategy, and (3) a proactive dynamic compression technique. Our proposed model effectively addresses these critical properties while outperforming multiple baselines across diverse online and offline benchmarks. Project Page:https://zhangyl4.github.io/publications/eyes-wide-open/
Authors (4)
Yulin Zhang
Cheng Shi
Yang Wang
Sibei Yang
Submitted
October 16, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

This paper introduces the task of proactive video-LLM for streaming video, along with the ESTP-Bench and ESTP-F1 metric for evaluation. It proposes a comprehensive pipeline including a data engine, multi-stage training, and dynamic compression to enable models to proactively answer evolving questions while maintaining synchronized perception and reasoning.

Business Value

Enables AI systems to provide real-time, anticipatory insights from video feeds, improving situational awareness in security, autonomous systems, and interactive experiences.