Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Yancheng He, Shilong Li, Jiaheng Liu, Weixun Wang, Xingyuan Bu, Ge Zhang, Zhongyuan Peng, Zhaoxiang Zhang, Wenbo Su, Bo Zheng
2025-02-27
Summary
This paper talks about a new way to test how well AI language models can spot mistakes in long, step-by-step reasoning processes, called Chain-of-Thought (CoT)
What's the problem?
As AI models get better at explaining their thinking through long chains of reasoning, it's important to know if they can also spot errors in these explanations. However, there wasn't a good way to test this ability before
What's the solution?
The researchers created DeltaBench, a collection of long reasoning examples from different AI models on various tasks like math, coding, and general reasoning. They used this to analyze how well different AI models create these explanations and how good they are at finding mistakes in them. They tested both models designed to judge the quality of reasoning steps and models meant to critique other AIs' work
Why it matters?
This matters because as AI gets more complex and is used for important decisions, we need to make sure it can not only explain its thinking but also catch its own mistakes. DeltaBench helps developers understand their AI models better, which could lead to more reliable and trustworthy AI systems in the future
Abstract
Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench, including the generated long CoTs from different o1-like models (e.g., QwQ, DeepSeek-R1) for different reasoning tasks (e.g., Math, Code, General Reasoning), to measure the ability to detect errors in long CoT reasoning. Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models. Then, we conduct extensive evaluations of existing process reward models (PRMs) and critic models to detect the errors of each annotated process, which aims to investigate the boundaries and limitations of existing PRMs and critic models. Finally, we hope that DeltaBench could guide developers to better understand the long CoT reasoning abilities of their models.