Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search

Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Gregory Wornell, Subhro Das, David Cox, Chuang Gan

2025-02-05

Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM
Reasoning via Autoregressive Search

Summary

This paper introduces Satori, a new AI model that uses a technique called Chain-of-Action-Thought (COAT) to improve how large language models (LLMs) think and solve problems. Satori trains itself to reflect on its reasoning and explore new strategies without needing outside help, making it better at tasks like solving math problems and generalizing to other areas.

What's the problem?

LLMs are good at reasoning, but they often need extra help from external systems to perform well on complex tasks. This reliance on outside guidance limits their ability to independently solve problems and improve their reasoning skills.

What's the solution?

The researchers created Satori, which uses COAT reasoning to teach the model how to think through problems step by step, reflect on its mistakes, and try new solutions. They trained Satori in two stages: first by teaching it the COAT reasoning format and then by using reinforcement learning to help it improve itself over time. This made Satori capable of solving complex problems on its own.

Why it matters?

Satori is important because it shows that LLMs can learn to reason more effectively without relying on external systems. This advancement could lead to smarter, more independent AI models that excel at solving a wide variety of tasks and help push the boundaries of AI research.

Abstract

Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models will be fully open-sourced.

View Paper