Step-DeepResearch Technical Report

Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, Yubo Shu, Yurong Zhang

2025-12-24

Summary

This paper focuses on improving how well large language models (LLMs) can perform complex, real-world research tasks, moving beyond just answering simple questions to actually *doing* research like a person would.

What's the problem?

Current ways of testing LLMs on research tasks, like benchmarks called BrowseComp, aren't very realistic. Real research needs a model to understand what's being asked, plan out a long series of steps, and check information from multiple sources to make sure it's correct. Existing tests don't really push models to do all of these things well, and they especially struggle with research in languages other than English.

What's the solution?

The researchers created a new system called Step-DeepResearch. It works by breaking down research into small, manageable steps and training the model to do each step really well. They also developed a way to train the model progressively, starting with basic skills and building up to more complex ones, and used a 'checklist' system to make sure the model doesn't miss important details. To test this, they also created a new, more challenging benchmark specifically for research in Chinese, called ADR-Bench.

Why it matters?

This work shows that you don't necessarily need a huge, expensive model to do good research. By training a medium-sized model (32B parameters) with these improved techniques, they were able to achieve performance comparable to the best closed-source models like those from OpenAI and Google, but at a much lower cost. This means more people and organizations could potentially use LLMs for serious research tasks.

Abstract

As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.

View Paper