SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques
2025-07-01
Summary
This paper talks about SPIRAL, a new training method where language models learn by playing zero-sum games against themselves, meaning one wins only if the other loses, which pushes the model to get better at reasoning.
What's the problem?
Most language model training relies on human-made questions and rewards, which limit how much and how well the models can learn reasoning skills because it takes lots of expert work and specific data.
What's the solution?
SPIRAL uses self-play in zero-sum games to create unlimited challenges automatically, forcing models to continuously improve by adapting to stronger versions of themselves. This training helps the AI develop reasoning patterns like breaking problems into steps and thinking about different possibilities, which transfer to real-world problem solving.
Why it matters?
This matters because it shows that AI can get better at reasoning without human supervision, making it possible to train smarter models faster and with less effort, potentially improving many applications like math problem solving and decision making.
Abstract
Self-play in zero-sum games using SPIRAL framework enhances reasoning capabilities in language models through continuous adaptation and transfer learning.