RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin

2025-01-27

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

Summary

This paper talks about RealCritic, a new way to test how well AI language models can critique and improve their own work or the work of others. It's like creating a special exam that checks if AIs can not only solve problems but also explain what they did wrong and fix their mistakes.

What's the problem?

Current ways of testing AI's ability to critique are too simple. They're like asking a student to grade their own test without checking if they actually learned from their mistakes. This doesn't show how well the AI can really improve its answers or help other AIs get better.

What's the solution?

The researchers created RealCritic, which is like a more advanced test for AIs. Instead of just asking the AI to point out mistakes, it checks if the AI can actually make its answers better after critiquing them. RealCritic also tests different types of critiquing, like when the AI looks at its own work, when it looks at other AIs' work, and when it keeps trying to improve an answer over multiple rounds. They tested this on eight tough problem-solving tasks.

Why it matters?

This matters because as AIs get smarter, we need better ways to test them. RealCritic shows that even though some AIs seem equally good at solving problems, the more advanced ones are much better at critiquing and improving answers. This could help make AIs that are not just good at giving answers, but also at explaining their thinking and fixing their mistakes. In the future, this could lead to AIs that are better at helping humans learn and solve complex problems.

Abstract

Critiques are important for enhancing the performance of Large Language Models (LLMs), enabling both self-improvement and constructive feedback for others by identifying flaws and suggesting improvements. However, evaluating the critique capabilities of LLMs presents a significant challenge due to the open-ended nature of the task. In this work, we introduce a new benchmark designed to assess the critique capabilities of LLMs. Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques. Moreover, the benchmark incorporates features such as self-critique, cross-critique, and iterative critique, which are crucial for distinguishing the abilities of advanced reasoning models from more classical ones. We implement this benchmark using eight challenging reasoning tasks. We have several interesting findings. First, despite demonstrating comparable performance in direct chain-of-thought generation, classical LLMs significantly lag behind the advanced reasoning-based model o1-mini across all critique scenarios. Second, in self-critique and iterative critique settings, classical LLMs may even underperform relative to their baseline capabilities. We hope that this benchmark will serve as a valuable resource to guide future advancements. The code and data are available at https://github.com/tangzhy/RealCritic.

View Paper