Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: To enhance the efficiency of the attention mechanism within large language
models (LLMs), previous works primarily compress the KV cache or group
attention heads, while largely overlooking redundancy between layers. Our
comprehensive analyses across various LLMs show that highly similar attention
patterns persist within most layers. It's intuitive to reduce the redundancy by
sharing attention weights across layers. However, further analysis reveals two
challenges: (1) Directly sharing the weight matrix without carefully
rearranging the attention heads proves to be ineffective; (2) Shallow layers
are vulnerable to small deviations in attention weights.
Driven by these insights, we introduce LISA, a lightweight substitute for
self-attention in well-trained LLMs. LISA employs tiny feed-forward networks to
align attention heads between adjacent layers and low-rank matrices to
approximate differences in layer-wise attention weights. Evaluations
encompassing 13 typical benchmarks demonstrate that LISA maintains high
response quality in terms of accuracy and perplexity while reducing redundant
attention calculations within 53%-84% of the total layers. Our implementations
of LISA achieve a 6x compression of Q and K matrices within the attention
mechanism, with maximum throughput improvements 19.5%, 32.3%, and 40.1% for
LLaMA3-8B, LLaMA2-7B, and LLaMA2-13B, respectively.
Authors (12)
Yongyu Mu
Yuzhang Wu
Yuchun Fan
Chenglong Wang
Hengyu Li
Jiali Zeng
+6 more
Key Contributions
Introduces LISA, a lightweight substitute for self-attention in LLMs that shares attention weights across layers using feed-forward networks and low-rank matrices. This approach addresses redundancy between layers, improving efficiency without significant performance degradation.
Business Value
Enables the deployment of larger and more capable LLMs by reducing computational costs and memory footprint, making advanced AI more accessible.