Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Multi-head attention (MHA) has become the cornerstone of modern large
language models, enhancing representational capacity through parallel attention
heads. However, increasing the number of heads inherently weakens individual
head capacity, and existing attention mechanisms - whether standard MHA or its
variants like grouped-query attention (GQA) and grouped-tied attention (GTA) -
simply concatenate outputs from isolated heads without strong interaction. To
address this limitation, we propose knocking-heads attention (KHA), which
enables attention heads to "knock" on each other - facilitating cross-head
feature-level interactions before the scaled dot-product attention. This is
achieved by applying a shared, diagonally-initialized projection matrix across
all heads. The diagonal initialization preserves head-specific specialization
at the start of training while allowing the model to progressively learn
integrated cross-head representations. KHA adds only minimal parameters and
FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention
variants. We validate KHA by training a 6.1B parameter MoE model (1.01B
activated) on 1T high-quality tokens. Compared to baseline attention
mechanisms, KHA brings superior and more stable training dynamics, achieving
better performance across downstream tasks.
Authors (5)
Zhanchao Zhou
Xiaodong Chen
Haoxing Chen
Zhenzhong Lan
Jianguo Li
Submitted
October 27, 2025
Key Contributions
This paper introduces Knocking-Heads Attention (KHA), a novel attention mechanism that enables cross-head feature-level interactions before scaled dot-product attention. By using a shared, diagonally-initialized projection matrix, KHA allows heads to progressively learn integrated representations while preserving initial specialization, adding minimal parameters.
Business Value
More efficient and powerful attention mechanisms can lead to faster training and inference times for LLMs, reducing computational costs and enabling deployment on less powerful hardware. This can accelerate AI development and application.