Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

Zhisong Qiu, Shuofei Qiao, Kewei Xu, Yuqi Zhu, Lun Du, Ningyu Zhang, Huajun Chen

2026-04-28

Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

Summary

This paper focuses on improving how we teach AI agents, specifically large language models, to analyze data and solve problems that change over time, like working with spreadsheets or databases. It builds on a technique called 'process reward modeling' which gives feedback during the problem-solving process, not just at the end.

What's the problem?

Existing AI reward systems, trained on general knowledge, aren't very good at guiding data analysis agents. They often miss subtle errors in calculations that don't cause a crash, and they incorrectly punish the agent for trying different approaches to find the right solution – essentially, they mistake exploration for failure. This makes it hard for the AI to learn effectively when dealing with dynamic data.

What's the solution?

The researchers created a new reward model called DataPRM. This model is specifically designed for data analysis and works in two key ways: first, it actively checks the agent's work by interacting with the data itself to find hidden errors. Second, it uses a more nuanced reward system that understands the difference between mistakes that can be fixed and those that are fatal, encouraging the agent to learn from its attempts. They also created a large dataset to train this new model, using a smart method to generate diverse and well-labeled examples.

Why it matters?

This work is important because it significantly improves the performance of AI agents on complex data analysis tasks. DataPRM allows these agents to be more accurate and efficient, even with a relatively small size. This advancement has the potential to make AI more useful in fields that rely heavily on data interpretation, like scientific research, business analytics, and financial modeling.

Abstract

Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strategy that distinguishes between correctable grounding errors and irrecoverable mistakes. We design a scalable pipeline to construct over 8K high-quality training instances for DataPRM via diversity-driven trajectory generation and knowledge-augmented step-level annotation. Experimental results demonstrate that DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep using Best-of-N inference. Notably, with only 4B parameters, DataPRM outperforms strong baselines, and exhibits robust generalizability across diverse Test-Time Scaling strategies. Furthermore, integrating DataPRM into Reinforcement Learning yields substantial gains over outcome-reward baselines, achieving 78.73% on DABench and 64.84% on TableBench, validating the effectiveness of process reward supervision. Code is available at https://github.com/zjunlp/DataMind.

View Paper