Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Modern multimodal large language models (MLLMs) can reason over hour-long
video, yet their key-value (KV) cache grows linearly with time-quickly
exceeding the fixed memory of phones, AR glasses, and edge robots. Prior
compression schemes either assume the whole video and user query are available
offline or must first build the full cache, so memory still scales with stream
length. InfiniPot-V is the first training-free, query-agnostic framework that
enforces a hard, length-independent memory cap for streaming video
understanding. During video encoding it monitors the cache and, once a user-set
threshold is reached, runs a lightweight compression pass that (i) removes
temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii)
keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four
open-source MLLMs and four long-video and streaming-video benchmarks,
InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation,
and matches or surpasses full-cache accuracy-even in multi-turn dialogues. By
dissolving the KV cache bottleneck without retraining or query knowledge,
InfiniPot-V closes the gap for on-device streaming video assistants.
Authors (4)
Minsoo Kim
Kyuhong Shim
Jungwook Choi
Simyung Chang
Key Contributions
Introduces InfiniPot-V, a training-free, query-agnostic framework for memory-constrained KV cache compression in streaming video understanding. It enforces a length-independent memory cap by removing temporally redundant tokens and keeping semantically significant ones, significantly reducing peak GPU memory while sustaining real-time generation.
Business Value
Enables powerful multimodal LLMs to operate on resource-constrained edge devices for real-time video understanding, opening up applications in surveillance, autonomous systems, and interactive AR/VR experiences.