Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, Jiawei Han

2025-03-13

Search-R1: Training LLMs to Reason and Leverage Search Engines with
Reinforcement Learning

Summary

This paper talks about Search-R1, an AI system that teaches language models to think, search the web, and refine answers using trial and error, like a student learning to solve problems by looking up facts and checking their work.

What's the problem?

Current AI models either can’t search the web effectively while answering questions or need tons of pre-written examples to learn how to use search tools properly.

What's the solution?

Search-R1 uses a reward system where the AI practices searching and answering questions, getting better based on whether the final answer is correct, without needing pre-made training data.

Why it matters?

This helps AI assistants give more accurate, up-to-date answers for homework help, research, or news by learning to search smarter, not harder.

Abstract

Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Retrieval augmentation and tool-use training approaches where a search engine is treated as a tool lack complex multi-turn retrieval flexibility or require large-scale supervised data. Prompting advanced LLMs with reasoning capabilities during inference to use search engines is not optimal, since the LLM does not learn how to optimally interact with the search engine. This paper introduces Search-R1, an extension of the DeepSeek-R1 model where the LLM learns -- solely through reinforcement learning (RL) -- to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM rollouts with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 26% (Qwen2.5-7B), 21% (Qwen2.5-3B), and 10% (LLaMA3.2-3B) over SOTA baselines. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.

View Paper