Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: With the development of large language models (LLMs), efficient inference
through Key-Value (KV) cache compression has attracted considerable attention,
especially for long-context generation. To compress the KV cache, recent
methods identify critical KV tokens through static modeling of attention
scores. However, these methods often struggle to accurately determine critical
tokens as they neglect the temporal patterns in attention scores, resulting in
a noticeable degradation in LLM performance. To address this challenge, we
propose AttentionPredictor, which is the first learning-based method to
directly predict attention patterns for KV cache compression and critical token
identification. Specifically, AttentionPredictor learns a lightweight, unified
convolution model to dynamically capture spatiotemporal patterns and predict
the next-token attention scores. An appealing feature of AttentionPredictor is
that it accurately predicts the attention score and shares the unified
prediction model, which consumes negligible memory, among all transformer
layers. Moreover, we propose a cross-token critical cache prefetching framework
that hides the token estimation time overhead to accelerate the decoding stage.
By retaining most of the attention information, AttentionPredictor achieves
13$\times$ KV cache compression and 5.6$\times$ speedup in a cache offloading
scenario with comparable LLM performance, significantly outperforming the
state-of-the-arts. The code is available at
https://github.com/MIRALab-USTC/LLM-AttentionPredictor.
Authors (11)
Qingyue Yang
Jie Wang
Xing Li
Zhihai Wang
Chen Chen
Lei Chen
+5 more
Submitted
February 6, 2025
Key Contributions
AttentionPredictor is the first learning-based method to directly predict attention patterns for KV cache compression. It uses a lightweight, unified convolutional model to dynamically capture spatiotemporal patterns and predict next-token attention scores, addressing the limitations of static modeling methods that neglect temporal dynamics.
Business Value
Enables faster and more cost-effective deployment of LLMs for applications requiring long-context generation, such as advanced chatbots, summarization tools, and content creation platforms.