Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: In this technical report, we present the Ring-linear model series,
specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0.
Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while
Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both
models adopt a hybrid architecture that effectively integrates linear attention
and softmax attention, significantly reducing I/O and computational overhead in
long-context inference scenarios. Compared to a 32 billion parameter dense
model, this series reduces inference cost to 1/10, and compared to the original
Ring series, the cost is also reduced by over 50%. Furthermore, through
systematic exploration of the ratio between different attention mechanisms in
the hybrid architecture, we have identified the currently optimal model
structure. Additionally, by leveraging our self-developed high-performance FP8
operator library-linghe, overall training efficiency has been improved by 50%.
Benefiting from the high alignment between the training and inference engine
operators, the models can undergo long-term, stable, and highly efficient
optimization during the reinforcement learning phase, consistently maintaining
SOTA performance across multiple challenging complex reasoning benchmarks.
Authors (28)
Ling Team
Bin Han
Caizhi Tang
Chen Liang
Donghao Zhang
Fan Yuan
+22 more
Submitted
October 22, 2025
Key Contributions
Introduces the Ring-linear model series (Ring-mini-linear-2.0, Ring-flash-linear-2.0) with a hybrid architecture combining linear and softmax attention to significantly reduce I/O and computational overhead for long-context inference. The work also details the optimization of attention ratios and the use of an FP8 operator library (linghe) to improve training efficiency by 50%.
Business Value
Enables the deployment of LLMs for tasks requiring long-context understanding (e.g., document summarization, extended dialogue) at a fraction of the current computational cost, making advanced AI more accessible.