Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Envision an AI capable of functioning in human-like settings, moving beyond
mere observation to actively understand, anticipate, and proactively respond to
unfolding events. Towards this vision, we focus on the innovative task where,
given ego-streaming video input, an assistant proactively answers diverse,
evolving questions at the opportune moment, while maintaining synchronized
perception and reasoning. This task embodies three key properties: (1)
Proactive Coherence, (2) Just-in-Time Responsiveness, and (3) Synchronized
Efficiency. To evaluate and address these properties, we first introduce
ESTP-Bench (Ego Streaming Proactive Benchmark) alongside the ESTP-F1 metric-a
novel framework designed for their rigorous assessment. Secondly, we propose a
comprehensive technical pipeline to enable models to tackle this challenging
task. This pipeline comprises: (1) a data engine, (2) a multi-stage training
strategy, and (3) a proactive dynamic compression technique. Our proposed model
effectively addresses these critical properties while outperforming multiple
baselines across diverse online and offline benchmarks. Project
Page:https://zhangyl4.github.io/publications/eyes-wide-open/
Authors (4)
Yulin Zhang
Cheng Shi
Yang Wang
Sibei Yang
Submitted
October 16, 2025
Key Contributions
This paper introduces the task of proactive video-LLM for streaming video, along with the ESTP-Bench and ESTP-F1 metric for evaluation. It proposes a comprehensive pipeline including a data engine, multi-stage training, and dynamic compression to enable models to proactively answer evolving questions while maintaining synchronized perception and reasoning.
Business Value
Enables AI systems to provide real-time, anticipatory insights from video feeds, improving situational awareness in security, autonomous systems, and interactive experiences.