Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Technical Report ML researchers,AI engineers,Hardware architects,LLM developers 2 weeks ago

Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

large-language-models › model-architecture
📄 Abstract

Abstract: In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.
Authors (28)
Ling Team
Bin Han
Caizhi Tang
Chen Liang
Donghao Zhang
Fan Yuan
+22 more
Submitted
October 22, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

Introduces the Ring-linear model series (Ring-mini-linear-2.0, Ring-flash-linear-2.0) with a hybrid architecture combining linear and softmax attention to significantly reduce I/O and computational overhead for long-context inference. The work also details the optimization of attention ratios and the use of an FP8 operator library (linghe) to improve training efficiency by 50%.

Business Value

Enables the deployment of LLMs for tasks requiring long-context understanding (e.g., document summarization, extended dialogue) at a fraction of the current computational cost, making advanced AI more accessible.