Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Vision-language models (VLMs) have recently expanded from static image
understanding to video reasoning, but their scalability is fundamentally
limited by the quadratic cost of processing dense frame sequences. Long videos
often exceed the token budget of modern language models, leading to severe
context limitations and latency issues. We introduce Efficient Video Sampling
(EVS), a simple, plug-and-play method for reducing token redundancy in videos
by identifying and pruning temporally static patches -- spatial regions that
remain unchanged across consecutive frames. EVS preserves positional identity,
requires no architectural changes or retraining. We show that EVS substantially
reduces token count while maintaining semantic fidelity, enabling faster
inference and longer input sequences. Applied at inference time, EVS reduces
large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal
accuracy loss. When combined with an uptraining phase using stochastic pruning
rates, EVS yields models that are robust to varying compression levels and
retain full performance under aggressive pruning. Extensive experiments
demonstrate that EVS consistently improves efficiency-accuracy trade-offs,
unlocking scalable video-language understanding without sacrificing quality.
Authors (12)
Natan Bagrov
Eugene Khvedchenia
Borys Tymchenko
Shay Aharon
Lior Kadoch
Tomer Keren
+6 more
Submitted
October 16, 2025
Key Contributions
Efficient Video Sampling (EVS) is a plug-and-play method that reduces token redundancy in videos by pruning temporally static patches. This significantly reduces token count and LLM latency (up to 4x TTFT reduction) with minimal accuracy loss, enabling faster inference and longer video inputs.
Business Value
Dramatically speeds up video analysis tasks powered by VLMs, making real-time applications like video moderation, surveillance analysis, and interactive video search more feasible and cost-effective.