s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto

2025-02-03

Summary

This paper talks about a new way to make AI language models better at solving complex problems, especially math questions. The researchers created a system called s1 that uses extra thinking time during testing to improve the AI's performance.

What's the problem?

Current AI language models are good at many tasks, but they sometimes struggle with difficult reasoning problems, like complex math questions. Recently, a model called o1 showed that giving AI more time to think during testing can help, but the exact method wasn't shared publicly. This made it hard for other researchers to replicate and improve on this approach.

What's the solution?

The researchers developed s1, a simpler version of this test-time scaling idea. They did two main things: First, they created a small but carefully chosen set of 1,000 questions with detailed reasoning steps. They used these to train the AI. Second, they invented 'budget forcing', which controls how long the AI thinks about a problem. If the AI tries to answer too quickly, it's forced to think longer by adding 'Wait' to its response. This often helps the AI correct mistakes. They applied these techniques to an existing AI model called Qwen2.5-32B-Instruct.

Why it matters?

This matters because it shows a way to make AI significantly better at complex reasoning tasks without needing to create entirely new, larger models. On some difficult math tests, s1 performed up to 27% better than the previous best model. This approach could help make AI more capable of solving real-world problems that require careful thinking and reasoning. By making their model, data, and code open-source, the researchers are also helping the whole AI community to build on and improve this work, potentially leading to even smarter AI systems in the future.

Abstract

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1.

View Paper