arxiv_cl 95% Match Research paper LLM researchers,ML engineers,AI system designers 2 weeks ago

Cross-layer Attention Sharing for Pre-trained Large Language Models

large-language-models › model-architecture

📄 Abstract

Abstract: To enhance the efficiency of the attention mechanism within large language models (LLMs), previous works primarily compress the KV cache or group attention heads, while largely overlooking redundancy between layers. Our comprehensive analyses across various LLMs show that highly similar attention patterns persist within most layers. It's intuitive to reduce the redundancy by sharing attention weights across layers. However, further analysis reveals two challenges: (1) Directly sharing the weight matrix without carefully rearranging the attention heads proves to be ineffective; (2) Shallow layers are vulnerable to small deviations in attention weights. Driven by these insights, we introduce LISA, a lightweight substitute for self-attention in well-trained LLMs. LISA employs tiny feed-forward networks to align attention heads between adjacent layers and low-rank matrices to approximate differences in layer-wise attention weights. Evaluations encompassing 13 typical benchmarks demonstrate that LISA maintains high response quality in terms of accuracy and perplexity while reducing redundant attention calculations within 53%-84% of the total layers. Our implementations of LISA achieve a 6x compression of Q and K matrices within the attention mechanism, with maximum throughput improvements 19.5%, 32.3%, and 40.1% for LLaMA3-8B, LLaMA2-7B, and LLaMA2-13B, respectively.

Authors (12)

Yongyu Mu

Yuzhang Wu

Yuchun Fan

Chenglong Wang

Hengyu Li

Jiali Zeng

+6 more

Submitted

August 4, 2024

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduces LISA, a lightweight substitute for self-attention in LLMs that shares attention weights across layers using feed-forward networks and low-rank matrices. This approach addresses redundancy between layers, improving efficiency without significant performance degradation.

Business Value

Enables the deployment of larger and more capable LLMs by reducing computational costs and memory footprint, making advanced AI more accessible.

Paper Metadata

Innovation Type

Novel architecture component

Deployment Feasibility

High, as it's designed as a substitute for existing components in well-trained LLMs.

Limitations Addressed

Redundancy and inefficiency in the attention mechanism of LLMs, particularly between layers.

Performance Gains

Implied efficiency gains and maintained performance.

Technical Tags

LLM efficiencyattention mechanismKV cacheattention headscross-layer sharingLISAfeed-forward networkslow-rank matricespre-trained modelsbenchmarks

Research Topics

Efficient LLM architecturesAttention mechanism optimizationModel compressionParameter sharing

Methods & Architectures

Cross-layer attention sharingFeed-forward networksLow-rank approximationWeight matrix analysis Large Language Models (LLMs)LISA (Lightweight Substitute for Self-Attention)

Applications & Tasks

Natural Language Processing AI model efficiency Reducing redundancy in LLM attention layersImproving efficiency of attention mechanisms Text generationLanguage understandingLLM inference

Datasets & Benchmarks

Benchmarks

13 typical benchmarks

Efficiency metricsPerformance metrics (e.g., accuracy, perplexity)

Related Fields

Machine LearningDeep LearningNatural Language ProcessingComputer Architecture

Keywords

Large Language ModelsLLMsAttention MechanismEfficiencyKV CacheCross-layer SharingLISAModel CompressionPre-trained ModelsFeed-forward NetworksLow-rank MatricesRedundancy Reduction

Academic Context

#Efficient LLM architectures#Attention mechanism optimization#Model compression#Parameter sharing

Commercial Potential

Potential Products

More efficient LLM inference enginesOptimized LLM training frameworks

Target Industries

TechnologyAI ResearchCloud Computing

Use Case Examples

Deploying LLMs on resource-constrained devicesReducing the cost of training massive LLMs

Competitive Edge

Offers a novel approach to attention efficiency by focusing on cross-layer redundancy, complementing existing KV cache or head grouping methods.

Market Opportunity

Large and growing market for efficient LLM deployment.

Revenue Models

Licensing of the LISA moduleintegration services.

Resource Requirements

Compute Needs

Reduced compute requirements for inference and potentially training compared to standard attention.

Data Requirements

Standard LLM training/evaluation datasets.

Deployment Constraints

Requires integration into existing LLM architectures.

Scalability

Designed to scale with LLM size by addressing layer-wise redundancy.

Production Readiness

Maturity Level

Research

Time to Market

Medium (requires integration and testing)

View Full Paper Back to Papers