AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

2025-06-17

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT
and RL Synergy

Summary

This paper talks about AceReason-Nemotron 1.1, a new AI model that gets better at solving math and coding problems by combining two training methods: supervised fine-tuning (SFT), where the model learns from many examples with correct answers, and reinforcement learning (RL), where the model learns by trying different answers and improving based on feedback. This combination lets the model think more clearly and solve harder problems.

What's the problem?

The problem is that while AI models can learn from examples, they sometimes struggle to solve very complex math and coding problems because they might not explore enough different solutions or understand how to improve from their attempts. Using only supervised learning or only reinforcement learning isn't enough to build the strongest reasoning skills, and it’s challenging to balance these training methods well.

What's the solution?

The solution was to carefully combine supervised fine-tuning and reinforcement learning by first training a strong base model with lots of examples, then gradually teaching it to explore and improve its answers through reinforcement learning. The researchers also adjusted important settings like sampling temperature to find the best balance between exploring new solutions and focusing on the best ones. This creates a model that learns effectively from both human examples and trial-and-error practice.

Why it matters?

This matters because it makes AI models much better at understanding and solving difficult math and coding tasks, which are key skills for many scientific and technical areas. Having AI that reasons better can help people solve complicated problems faster and with more accuracy, advancing technology and supporting learning and development in many fields.

Abstract

Combining supervised fine-tuning and reinforcement learning enhances reasoning models, especially when optimizing sampling temperature and leveraging strong initial fine-tuning, as demonstrated by the improved AceReason-Nemotron-1.1 model.

View Paper