Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The Query, Key, Value weight triplet is a building block of current attention
mechanisms in state-of-the-art LLMs. We theoretically investigate whether this
triplet can be reduced, proving under simplifying assumptions that the Query
weights are redundant, thereby reducing the number of non-embedding/lm-head
parameters by over 8%. We validate the theory on full-complexity GPT-3 small
architectures (with layer normalization, skip connections, and weight decay)
trained from scratch, demonstrating that the reduced model achieves comparable
validation loss to standard baselines. These findings motivate the
investigation of the Query weight redundancy at scale.
Authors (2)
Marko Karbevski
Antonij Mijoski
Submitted
October 27, 2025
Key Contributions
This paper theoretically proves that Query weights are redundant in the Query, Key, Value (QKV) triplet of attention mechanisms in decoder-only transformers, under simplifying assumptions. Empirical validation on GPT-3 small models confirms that reducing these weights leads to comparable validation loss, suggesting a potential to reduce non-embedding/LM-head parameters by over 8%.
Business Value
Reducing model parameters leads to smaller model sizes, faster training, lower inference costs, and potentially enables deployment on resource-constrained devices, making LLMs more accessible and economical.