Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model

Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, Chengyao Wen, Congqi Li, Deng Zhao, Dingbo Yuan, Donghai You, Fagui Mao, Fanzhuang Meng, Feng Xu, Guojie Li, Guowei Wang, Hao Dai, Haonan Zheng

2025-10-22

Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model

Summary

This paper introduces Ring-1T, a new artificial intelligence model that's incredibly large – it has one trillion parameters, which are the adjustable values the model uses to learn. It's designed to be a 'thinking' model, meaning it's good at reasoning and problem-solving, and importantly, it's being released publicly for anyone to use and build upon.

What's the problem?

Building and training a model with a trillion parameters is really hard. There were three main issues: first, the way the model behaves during training is different than when it's actually used, causing instability. Second, running simulations to improve the model (called 'rollouts') takes a lot of computing power and time. Finally, the systems used to train these massive models had bottlenecks that slowed everything down.

What's the solution?

The researchers came up with three connected solutions. 'IcePop' fixes the training-versus-use problem by masking and limiting differences between the two. 'C3PO++' makes the rollouts more efficient by smartly dividing them up to fit within computing limits. And 'ASystem' is a new framework designed to remove the bottlenecks in the training process, allowing the trillion-parameter model to learn effectively.

Why it matters?

Ring-1T achieves top results on several challenging tests of reasoning and problem-solving, even getting a silver medal-level score on a difficult math competition. By making this powerful model openly available, the researchers are helping to advance AI research and make advanced reasoning capabilities more accessible to everyone, setting a new standard for open-source AI performance.

Abstract

We present Ring-1T, the first open-source, state-of-the-art thinking model with a trillion-scale parameter. It features 1 trillion total parameters and activates approximately 50 billion per token. Training such models at a trillion-parameter scale introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system. To address these, we pioneer three interconnected innovations: (1) IcePop stabilizes RL training via token-level discrepancy masking and clipping, resolving instability from training-inference mismatches; (2) C3PO++ improves resource utilization for long rollouts under a token budget by dynamically partitioning them, thereby obtaining high time efficiency; and (3) ASystem, a high-performance RL framework designed to overcome the systemic bottlenecks that impede trillion-parameter model training. Ring-1T delivers breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-v1. Notably, it attains a silver medal-level result on the IMO-2025, underscoring its exceptional reasoning capabilities. By releasing the complete 1T parameter MoE model to the community, we provide the research community with direct access to cutting-edge reasoning capabilities. This contribution marks a significant milestone in democratizing large-scale reasoning intelligence and establishes a new baseline for open-source model performance.

View Paper