SPICE: Self-Play In Corpus Environments Improves Reasoning

Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, Jason Weston

2025-10-29

SPICE: Self-Play In Corpus Environments Improves Reasoning

Summary

This paper introduces a new way for AI systems to get better at reasoning by essentially playing against themselves, but with a twist: they use real-world information from a large collection of documents to make the challenges more realistic and constantly evolving.

What's the problem?

Typically, AI models that learn through self-play – where they improve by competing with themselves – often hit a plateau because they start creating tasks that aren't challenging enough or aren't connected to the real world. This limits how much they can actually improve their reasoning skills. Existing methods struggle to consistently create progressively harder problems for the AI to solve, leading to stalled progress.

What's the solution?

The researchers developed a framework called SPICE. It has two parts: a 'Challenger' and a 'Reasoner'. The Challenger searches through a huge database of documents to create difficult reasoning problems. The Reasoner then tries to solve those problems. As the Reasoner gets better, the Challenger creates even harder problems based on the documents, pushing the Reasoner to continually improve. This process uses real-world information ('corpus grounding') to keep the challenges relevant and complex.

Why it matters?

This research is important because it shows a way to build AI systems that can continuously learn and improve their reasoning abilities without needing constant human intervention. By grounding the self-play in real-world data, SPICE avoids the limitations of previous methods and achieves significant gains in both mathematical and general reasoning tasks, suggesting a path towards more robust and adaptable AI.

Abstract

Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts in two roles: a Challenger that mines documents from a large corpus to generate diverse reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics, the Challenger creates an automatic curriculum at the frontier of the Reasoner's capability, while corpus grounding provides the rich, near-inexhaustible external signal necessary for sustained improvement. Unlike existing ungrounded self-play methods that offer more limited benefits, SPICE achieves consistent gains across mathematical (+8.9%) and general reasoning (+9.8%) benchmarks on multiple model families. Our analysis reveals how document grounding is a key ingredient in SPICE to continuously generate its own increasingly challenging goals and achieve them, enabling sustained self-improvement.

View Paper