Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper AI Researchers,ML Engineers,LLM Developers,Hardware Optimization Specialists 1 week ago

Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Decoder-Only Transformers

large-language-models › model-architecture
📄 Abstract

Abstract: The Query, Key, Value weight triplet is a building block of current attention mechanisms in state-of-the-art LLMs. We theoretically investigate whether this triplet can be reduced, proving under simplifying assumptions that the Query weights are redundant, thereby reducing the number of non-embedding/lm-head parameters by over 8%. We validate the theory on full-complexity GPT-3 small architectures (with layer normalization, skip connections, and weight decay) trained from scratch, demonstrating that the reduced model achieves comparable validation loss to standard baselines. These findings motivate the investigation of the Query weight redundancy at scale.
Authors (2)
Marko Karbevski
Antonij Mijoski
Submitted
October 27, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

This paper theoretically proves that Query weights are redundant in the Query, Key, Value (QKV) triplet of attention mechanisms in decoder-only transformers, under simplifying assumptions. Empirical validation on GPT-3 small models confirms that reducing these weights leads to comparable validation loss, suggesting a potential to reduce non-embedding/LM-head parameters by over 8%.

Business Value

Reducing model parameters leads to smaller model sizes, faster training, lower inference costs, and potentially enables deployment on resource-constrained devices, making LLMs more accessible and economical.