arxiv_cv 95% Match Research Paper ML Engineers,VLM Researchers,Developers working with video data 2 weeks ago

Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

large-language-models › multimodal-llms

📄 Abstract

Abstract: Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues. We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches -- spatial regions that remain unchanged across consecutive frames. EVS preserves positional identity, requires no architectural changes or retraining. We show that EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences. Applied at inference time, EVS reduces large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal accuracy loss. When combined with an uptraining phase using stochastic pruning rates, EVS yields models that are robust to varying compression levels and retain full performance under aggressive pruning. Extensive experiments demonstrate that EVS consistently improves efficiency-accuracy trade-offs, unlocking scalable video-language understanding without sacrificing quality.

Authors (12)

Natan Bagrov

Eugene Khvedchenia

Borys Tymchenko

Shay Aharon

Lior Kadoch

Tomer Keren

+6 more

Submitted

October 16, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Efficient Video Sampling (EVS) is a plug-and-play method that reduces token redundancy in videos by pruning temporally static patches. This significantly reduces token count and LLM latency (up to 4x TTFT reduction) with minimal accuracy loss, enabling faster inference and longer video inputs.

Business Value

Dramatically speeds up video analysis tasks powered by VLMs, making real-time applications like video moderation, surveillance analysis, and interactive video search more feasible and cost-effective.

Paper Metadata

Innovation Type

Algorithmic Optimization

Deployment Feasibility

Very High, as it's a plug-and-play method applied at inference time with no architectural changes required.

Limitations Addressed

The quadratic cost of processing dense video frames limits VLM scalability and leads to high latency. EVS addresses this by efficiently reducing the number of tokens processed without retraining.

Performance Gains

Up to 4x reduction in LLM TTFT,Substantial reduction in token count

Technical Tags

video samplingtoken pruningVLM inferencetemporal redundancyLLM latencyplug-and-playinference optimizationframe sequencessemantic fidelity

Research Topics

Efficient AI InferenceVideo UnderstandingLarge Language ModelsMultimodal AI

Methods & Architectures

Efficient Video Sampling (EVS)Token pruningIdentification of temporally static patches Vision-Language Models (VLMs)Large Language Models (LLMs)

Applications & Tasks

Video Analysis Real-time AI Content Moderation Surveillance High computational cost of video processingLLM latencyToken budget limitationsTemporal redundancy Faster VLM inferenceReducing token count in video sequencesEnabling longer input sequences for LLMs

Related Fields

Computer VisionNatural Language ProcessingMachine LearningOptimization

Keywords

video processingVLMLLMinferenceoptimizationtoken pruningtemporal redundancylatencyefficientplug-and-play

Academic Context

#Efficient AI Inference#Video Understanding#Large Language Models#Multimodal AI

Commercial Potential

Potential Products

Video analysis acceleration librariesOptimized inference engines for VLMs

Target Industries

MediaSecuritySurveillanceContent Platforms

Use Case Examples

Real-time analysis of security camera feedsFaster content moderation for video platformsInteractive video search and summarization

Competitive Edge

Offers a highly efficient and easy-to-integrate solution for accelerating VLM inference on videos, addressing a key bottleneck in current systems.

Market Opportunity

Large and growing market for efficient video AI solutions.

Revenue Models

Licensing of optimization techniquesintegration into AI platforms.

Resource Requirements

Compute Needs

Low during inference, as it reduces computation.

Data Requirements

Requires video data for processing.

Scalability

Highly scalable due to its efficiency improvements.

Production Readiness

Maturity Level

Research

Time to Market

Short

View Full Paper Back to Papers