Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Humans naturally perform temporal screening by dragging the progress bar and
focusing on salient temporal segments, but current Video Large Language Models
(Video-LLMs) struggle to capture fine-grained temporal semantics due to sparse
frame sampling and insufficient inter-frame reasoning supervision during their
training. To address this, Inspired by well-established cognitive science
principles, we propose Temporal Visual Screening (TVS), a new task that
universally pre-processes video question answering and instruction tuning data
by: (1) retaining focus-critical video segments, (2) synchronously
reconstructing queries to their most direct form while preserving answer
consistency, and (3) keeping the invariance and consistency for any possible
answer. TVS is formulated as a modular front-end adapter task that can be
seamlessly integrated into both Video Instruction Tuning (training) and Video
Question Answering (inference) pipelines. TVS optimizes distribution of
reasoning burden and cognitive load; during training, it aligns queries with
focus-critical visual information; at inference, it enables query-aware segment
focus and streamlined query representations. In particular, we curate the first
benchmark for TVS and propose ReSimplifyIt, a baseline outperforming prior
approaches on seemingly similar tasks by 0.47 in F-1 score on video trimming
while achieving competitive query rewriting performance. Experiments
demonstrate that incorporating TVS yields relative gains of 7.33% (training)
and 34.6% (inference), demonstrating the effectiveness of temporal information
screening for improving video-language understanding.