Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Scaling language models to handle longer input sequences typically
necessitates large key-value (KV) caches, resulting in substantial memory
overhead during inference. In this paper, we propose Tensor Product Attention
(TPA), a novel attention mechanism that uses tensor decompositions to represent
queries, keys, and values compactly, substantially shrinking the KV cache size
at inference time. By factorizing these representations into contextual
low-rank components and seamlessly integrating with Rotary Position Embedding
(RoPE), TPA achieves improved model quality alongside memory efficiency. Based
on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model
architecture for sequence modeling. Through extensive empirical evaluation on
language modeling tasks, we demonstrate that T6 surpasses or matches the
performance of standard Transformer baselines including Multi-Head Attention
(MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and
Multi-Head Latent Attention (MLA) across various metrics, including perplexity
and a range of established evaluation benchmarks. Notably, TPA's memory
efficiency and computational efficiency at decoding stage enables processing
longer sequences under fixed resource constraints, addressing a critical
scalability challenge in modern language models. Project Page:
https://github.com/tensorgi/TPA.
Authors (7)
Yifan Zhang
Yifeng Liu
Huizhuo Yuan
Zhen Qin
Yang Yuan
Quanquan Gu
+1 more
Submitted
January 11, 2025
Key Contributions
Introduces Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to compactly represent queries, keys, and values. TPA significantly reduces the KV cache size during inference, leading to substantial memory savings while maintaining or improving model quality. The paper also introduces the T6 Transformer architecture based on TPA.
Business Value
Enables more efficient deployment and scaling of large language models, reducing infrastructure costs and allowing for faster inference, which is critical for real-time applications and wider accessibility.