arxiv_ml 95% Match Research Paper ML Engineers,AI Researchers,Robotics Engineers,Developers of edge AI applications 1 week ago

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

large-language-models › multimodal-llms

📄 Abstract

Abstract: Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time-quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation, and matches or surpasses full-cache accuracy-even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.

Authors (4)

Minsoo Kim

Kyuhong Shim

Jungwook Choi

Simyung Chang

Submitted

June 18, 2025

arXiv Category

eess.IV

arXiv PDF

Key Contributions

Introduces InfiniPot-V, a training-free, query-agnostic framework for memory-constrained KV cache compression in streaming video understanding. It enforces a length-independent memory cap by removing temporally redundant tokens and keeping semantically significant ones, significantly reducing peak GPU memory while sustaining real-time generation.

Business Value

Enables powerful multimodal LLMs to operate on resource-constrained edge devices for real-time video understanding, opening up applications in surveillance, autonomous systems, and interactive AR/VR experiences.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

High. Designed for edge devices and streaming, training-free nature simplifies deployment.

Limitations Addressed

Linear growth of KV cache with video length, exceeding memory limits on edge devices, and prior schemes requiring offline processing or full cache build-up.

Performance Gains

Up to 94% reduction in peak GPU memory, sustains real-time generation.

Technical Tags

KV cache compressionstreaming video understandingmemory constraintmultimodal LLMsreal-time generationtemporal redundancysemantic significancequery-agnostictraining-freeedge devices

Research Topics

Large Language ModelsMultimodal AIVideo UnderstandingEfficient AIEdge Computing

Methods & Architectures

KV cache compressionTemporal-axis Redundancy (TaR) metricValue-Norm (VaN) rankingTraining-free framework Multimodal Large Language Models (MLLMs)

Applications & Tasks

Video Analysis Edge AI Robotics AR/VR Memory constraint in LLMsKV cache growthStreaming video processingReal-time AI on edge devices Video understandingReasoning over long videosReal-time video analysis on resource-constrained devices

Datasets & Benchmarks

Benchmarks

long-video benchmarks • streaming-video benchmarks

Peak GPU memory reductionReal-time generation capability

Related Fields

Computer VisionNatural Language ProcessingEdge ComputingEfficient Deep Learning

Keywords

KV cache compressionstreaming videomultimodal LLMmemory efficiencyedge AIreal-timevideo understandingInfiniPot-VTaRVaNtraining-freeAR glassesedge robots

Academic Context

#Large Language Models#Multimodal AI#Video Understanding#Efficient AI#Edge Computing

Commercial Potential

Potential Products

On-device video analysis toolsReal-time AI assistants for edge devicesEfficient video processing pipelines

Target Industries

Consumer ElectronicsAutomotiveSecurityRoboticsTelecommunications

Use Case Examples

Real-time object recognition on dronesOn-device video summarization for security camerasInteractive AR experiences without cloud reliance

Competitive Edge

First training-free, query-agnostic framework for KV cache compression in streaming video, offering significant memory savings and real-time performance on edge devices.

Market Opportunity

Rapid growth in edge AI and demand for efficient multimodal models.

Revenue Models

Licensing of the compression technologyintegration into hardware/software products.

Resource Requirements

Compute Needs

Low for inference on edge devices, moderate for video encoding.

Data Requirements

Requires video data streams.

Deployment Constraints

Limited by the processing power of the target edge device.

Scalability

Highly scalable to different video lengths and device constraints due to its memory-capping mechanism.

Production Readiness

Maturity Level

Research

Time to Market

Medium

Patent Potential

Moderate

View Full Paper Back to Papers