arxiv_ai 95% Match Research Paper AI Researchers,ML Engineers,HPC Specialists,Developers of Large-Scale AI Systems 2 weeks ago

Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model

large-language-models › training-methods

📄 Abstract

Abstract: We present Ring-1T, the first open-source, state-of-the-art thinking model with a trillion-scale parameter. It features 1 trillion total parameters and activates approximately 50 billion per token. Training such models at a trillion-parameter scale introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system. To address these, we pioneer three interconnected innovations: (1) IcePop stabilizes RL training via token-level discrepancy masking and clipping, resolving instability from training-inference mismatches; (2) C3PO++ improves resource utilization for long rollouts under a token budget by dynamically partitioning them, thereby obtaining high time efficiency; and (3) ASystem, a high-performance RL framework designed to overcome the systemic bottlenecks that impede trillion-parameter model training. Ring-1T delivers breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-v1. Notably, it attains a silver medal-level result on the IMO-2025, underscoring its exceptional reasoning capabilities. By releasing the complete 1T parameter MoE model to the community, we provide the research community with direct access to cutting-edge reasoning capabilities. This contribution marks a significant milestone in democratizing large-scale reasoning intelligence and establishes a new baseline for open-source model performance.

Authors (104)

Ling Team

Anqi Shen

Baihui Li

Bin Hu

Bin Jing

Cai Chen

+98 more

Submitted

October 21, 2025

arXiv Category

cs.CL

arXiv PDF Code

Key Contributions

This paper introduces Ring-1T, the first open-source trillion-scale thinking model, and three key innovations: IcePop for stabilizing RL training via token-level discrepancy masking and clipping; C3PO++ for improving resource utilization in long rollouts through dynamic partitioning; and ASystem, a high-performance RL framework to overcome systemic bottlenecks in training trillion-parameter models.

Business Value

Enables the development of significantly more powerful AI models capable of complex reasoning and problem-solving, pushing the boundaries of AI capabilities for various applications.

Paper Metadata

Innovation Type

Algorithmic and Systemic Innovation

Deployment Feasibility

Deployment of the model itself is feasible (open-source), but requires massive computational infrastructure. The training techniques are applicable to other large models.

Limitations Addressed

Instability in RL training of massive models,Inefficient processing of long rollouts,Systemic bottlenecks hindering trillion-parameter model training,Train-inference misalignment

Performance Gains

Achieves state-of-the-art results on critical benchmarks (e.g., AIME-2025) with a trillion-parameter model, enabled by novel training techniques.

View Code on GitHub

Technical Tags

Trillion-Scale ModelsReinforcement Learning (RL)Large Language Models (LLMs)Training StabilityResource UtilizationSystem BottlenecksToken-Level Discrepancy MaskingClippingRollout ProcessingDistributed Training

Research Topics

Scaling AI ModelsLLM TrainingReinforcement LearningEfficient AIHigh-Performance Computing

Methods & Architectures

IcePop (stabilization technique)C3PO++ (resource utilization)ASystem (RL framework)Token-level discrepancy maskingClippingDynamic partitioning of rollouts Trillion-parameter LLM (Ring-1T)

Applications & Tasks

Natural Language Processing Artificial Intelligence Research Training instability at scaleInefficiencies in RL rollout processingSystemic bottlenecks in trillion-parameter model trainingTrain-inference misalignment Training trillion-scale LLMsImproving RL training stability and efficiencyEnabling large-scale reasoning and problem-solving

Related Fields

Artificial IntelligenceMachine LearningHigh-Performance ComputingNatural Language ProcessingDistributed Systems

Keywords

Trillion-parameter modelLLMReinforcement LearningTraining StabilityScalabilityIcePopC3PO++ASystemHigh-Performance ComputingReasoningOpen-source AI

Academic Context

#Scaling AI Models#LLM Training#Reinforcement Learning#Efficient AI#High-Performance Computing

Technology Stack

Frameworks & Libraries

ASystem (RL framework)

ML Infrastructure

High-performance distributed training systems

Commercial Potential

Potential Products

Advanced AI reasoning enginesFoundation models for complex problem-solvingNext-generation AI assistants

Target Industries

TechnologyResearch and DevelopmentFinanceScience

Use Case Examples

Solving complex mathematical and scientific problemsGenerating highly coherent and creative long-form textPowering advanced AI research platforms

Competitive Edge

Represents a significant leap in model scale and capability, setting a new benchmark for reasoning and problem-solving in open-source AI.

Market Opportunity

Indicates a trend towards larger, more capable AI models, driving future market growth.

Revenue Models

API access to the modelLicensing for specific enterprise applicationsFoundation for specialized AI services

Resource Requirements

Compute Needs

Extremely high; requires massive GPU clusters and distributed training infrastructure for trillion-parameter models.

Data Requirements

Requires vast amounts of diverse training data, likely including text, code, and potentially structured data.

Deployment Constraints

Massive computational requirements for inference,High energy consumption,Specialized hardware infrastructure

Scalability

Designed for extreme scale, but inference scalability remains a challenge.

Regulatory Considerations

Ethical considerations regarding the capabilities and potential misuse of such powerful models.

Production Readiness

Maturity Level

State-of-the-art Research

Time to Market

N/A (research model), but enables future products.

Licensing

Open-source (specific license not mentioned)

Patent Potential

Moderate, for the specific training techniques (IcePop, C3PO++, ASystem).

View Full Paper Back to Papers