arxiv_cl 95% Match Research Paper ML Researchers,Deep Learning Engineers,NLP Practitioners,AI Architects 1 week ago

Knocking-Heads Attention

large-language-models › model-architecture

📄 Abstract

Abstract: Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of heads inherently weakens individual head capacity, and existing attention mechanisms - whether standard MHA or its variants like grouped-query attention (GQA) and grouped-tied attention (GTA) - simply concatenate outputs from isolated heads without strong interaction. To address this limitation, we propose knocking-heads attention (KHA), which enables attention heads to "knock" on each other - facilitating cross-head feature-level interactions before the scaled dot-product attention. This is achieved by applying a shared, diagonally-initialized projection matrix across all heads. The diagonal initialization preserves head-specific specialization at the start of training while allowing the model to progressively learn integrated cross-head representations. KHA adds only minimal parameters and FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention variants. We validate KHA by training a 6.1B parameter MoE model (1.01B activated) on 1T high-quality tokens. Compared to baseline attention mechanisms, KHA brings superior and more stable training dynamics, achieving better performance across downstream tasks.

Authors (5)

Zhanchao Zhou

Xiaodong Chen

Haoxing Chen

Zhenzhong Lan

Jianguo Li

Submitted

October 27, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper introduces Knocking-Heads Attention (KHA), a novel attention mechanism that enables cross-head feature-level interactions before scaled dot-product attention. By using a shared, diagonally-initialized projection matrix, KHA allows heads to progressively learn integrated representations while preserving initial specialization, adding minimal parameters.

Business Value

More efficient and powerful attention mechanisms can lead to faster training and inference times for LLMs, reducing computational costs and enabling deployment on less powerful hardware. This can accelerate AI development and application.

Paper Metadata

Innovation Type

Architectural

Deployment Feasibility

Feasible, as it's an architectural modification that can be integrated into existing transformer-based models.

Limitations Addressed

The inherent weakening of individual attention head capacity with increasing head count and the lack of strong interaction between heads in standard MHA and its variants.

Performance Gains

Achieves cross-head feature-level interactions with minimal parameter increase, potentially leading to improved model performance.

Technical Tags

multi-head attentionattention mechanismsLLM architecturefeature interactionparameter efficiencydiagonal initializationgrouped-query attention (GQA)

Research Topics

Machine LearningDeep LearningNatural Language ProcessingModel ArchitecturesAttention Mechanisms

Methods & Architectures

Knocking-Heads Attention (KHA)Shared Projection MatrixDiagonal InitializationComparative Analysis Multi-Head Attention (MHA)Knocking-Heads Attention (KHA)Grouped-Query Attention (GQA)Grouped-Tied Attention (GTA)

Applications & Tasks

Deep Learning Model Design NLP Model Optimization Weakened individual head capacity in MHALack of strong interaction between attention headsParameter inefficiency in attention variants Improving attention mechanismsEnhancing representational capacityModel efficiency

Related Fields

Machine LearningDeep LearningNatural Language ProcessingComputer VisionTransformer Architectures

Keywords

attention mechanismmulti-head attentionLLM architecturetransformerdeep learningNLPfeature interactionparameter efficiencyneural networksGQA

Academic Context

#Machine Learning#Deep Learning#Natural Language Processing#Model Architectures#Attention Mechanisms

Commercial Potential

Potential Products

More efficient LLM architecturesOptimized attention layers for various deep learning tasks

Target Industries

AI DevelopmentTechnologyCloud Computing

Use Case Examples

Building faster and more capable language modelsImproving performance in sequence modeling tasksDeveloping more efficient neural network components

Competitive Edge

Offers a novel approach to enhancing multi-head attention by enabling direct cross-head interactions, potentially outperforming existing variants like GQA/GTA in certain scenarios.

Market Opportunity

Core component of modern AI models, with continuous innovation driving market growth.

Revenue Models

Licensing of the architectural innovationintegration into proprietary model designs.

Resource Requirements

Compute Needs

Requires significant compute for training and evaluating models with the new attention mechanism.

Data Requirements

Standard NLP datasets for training and evaluation.

Deployment Constraints

The effectiveness of KHA needs to be validated across a wide range of tasks and model sizes.

Scalability

Designed to be a drop-in replacement for standard attention, suggesting good scalability.

Production Readiness

Maturity Level

Research

Time to Market

Medium-term (1-3 years) for adoption in new model architectures.

View Full Paper Back to Papers