Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper ML Researchers,Deep Learning Engineers,NLP Practitioners,AI Architects 1 week ago

Knocking-Heads Attention

large-language-models › model-architecture
📄 Abstract

Abstract: Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of heads inherently weakens individual head capacity, and existing attention mechanisms - whether standard MHA or its variants like grouped-query attention (GQA) and grouped-tied attention (GTA) - simply concatenate outputs from isolated heads without strong interaction. To address this limitation, we propose knocking-heads attention (KHA), which enables attention heads to "knock" on each other - facilitating cross-head feature-level interactions before the scaled dot-product attention. This is achieved by applying a shared, diagonally-initialized projection matrix across all heads. The diagonal initialization preserves head-specific specialization at the start of training while allowing the model to progressively learn integrated cross-head representations. KHA adds only minimal parameters and FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention variants. We validate KHA by training a 6.1B parameter MoE model (1.01B activated) on 1T high-quality tokens. Compared to baseline attention mechanisms, KHA brings superior and more stable training dynamics, achieving better performance across downstream tasks.
Authors (5)
Zhanchao Zhou
Xiaodong Chen
Haoxing Chen
Zhenzhong Lan
Jianguo Li
Submitted
October 27, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

This paper introduces Knocking-Heads Attention (KHA), a novel attention mechanism that enables cross-head feature-level interactions before scaled dot-product attention. By using a shared, diagonally-initialized projection matrix, KHA allows heads to progressively learn integrated representations while preserving initial specialization, adding minimal parameters.

Business Value

More efficient and powerful attention mechanisms can lead to faster training and inference times for LLMs, reducing computational costs and enabling deployment on less powerful hardware. This can accelerate AI development and application.