START: Self-taught Reasoner with Tools

Chengpeng Li, Mingfeng Xue, Zhenru Zhang, Jiaxi Yang, Beichen Zhang, Xiang Wang, Bowen Yu, Binyuan Hui, Junyang Lin, Dayiheng Liu

2025-03-07

Summary

This paper talks about START, a new AI system that improves how advanced reasoning models solve complex problems by using external tools like code execution.

What's the problem?

Big reasoning AI models, like those used for solving science or math problems, often make mistakes or waste time because they rely only on their internal thinking process and don't use external tools to check or improve their answers.

What's the solution?

The researchers created START, which uses hints to encourage the AI to use tools like Python for calculations or debugging during problem-solving. They also developed a method called Hint Rejection Sampling Fine-Tuning (Hint-RFT), which helps the AI learn better reasoning paths by scoring and improving its tool-based solutions. This approach makes the model smarter and more accurate in solving difficult tasks.

Why it matters?

This matters because it makes AI better at solving complex problems in areas like science, math, and coding. By combining reasoning with external tools, START reduces mistakes and improves efficiency, bringing us closer to AI systems that can handle real-world challenges at a high level of accuracy.

Abstract

Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introduce START (Self-Taught Reasoner with Tools), a novel tool-integrated long CoT reasoning LLM that significantly enhances reasoning capabilities by leveraging external tools. Through code execution, START is capable of performing complex computations, self-checking, exploring diverse methods, and self-debugging, thereby addressing the limitations of LRMs. The core innovation of START lies in its self-learning framework, which comprises two key techniques: 1) Hint-infer: We demonstrate that inserting artificially designed hints (e.g., ``Wait, maybe using Python here is a good idea.'') during the inference process of a LRM effectively stimulates its ability to utilize external tools without the need for any demonstration data. Hint-infer can also serve as a simple and effective sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and modifying the reasoning trajectories with tool invocation generated by a LRM via Hint-infer, followed by fine-tuning the LRM. Through this framework, we have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA (GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the competition-level code benchmark (LiveCodeBench), START achieves accuracy rates of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary model o1-Preview.

View Paper