RefineBench: Evaluating Refinement Capability of Language Models via Checklists

Young-Jun Lee, Seungone Kim, Byung-Kwan Lee, Minkyeong Moon, Yechan Hwang, Jong Myoung Kim, Graham Neubig, Sean Welleck, Ho-Jin Choi

2025-12-01

RefineBench: Evaluating Refinement Capability of Language Models via Checklists

Summary

This paper investigates whether large language models (LMs) can improve their own answers, a skill that's becoming increasingly important as people often ask for revisions and clarifications. It explores how well LMs do at refining responses both when given direct feedback and when trying to improve on their own.

What's the problem?

Currently, testing how well LMs refine their answers has mostly focused on simple tasks like math problems where the right answer is clear. However, real-world questions are often open-ended and people give feedback in different ways. The researchers wanted to see how well LMs can actually handle this more realistic type of refinement, especially without being explicitly told what's wrong with their initial response.

What's the solution?

The researchers created a new benchmark called RefineBench, which includes 1,000 challenging questions across many different subjects. They then tested several LMs, including very advanced ones like Gemini 2.5 Pro and GPT-5, in two ways: first, by giving them natural language feedback, and second, by letting them try to improve their answers on their own over multiple attempts. They carefully tracked how much the models improved with each iteration.

Why it matters?

The results showed that even the best LMs struggle to consistently improve their answers without guidance. While they can get very close to perfect when given specific feedback, self-refinement is a major challenge. This means that further research is needed to develop LMs that can truly learn from their mistakes and refine their responses independently, and RefineBench provides a good way to measure progress in this area.

Abstract

Can language models (LMs) self-refine their own responses? This question is increasingly relevant as a wide range of real-world user interactions involve refinement requests. However, prior studies have largely tested LMs' refinement abilities on verifiable tasks such as competition math or symbolic reasoning with simplified scaffolds, whereas users often pose open-ended queries and provide varying degrees of feedback on what they desire. The recent advent of reasoning models that exhibit self-reflection patterns in their chains-of-thought further motivates this question. To analyze this, we introduce RefineBench, a benchmark of 1,000 challenging problems across 11 domains paired with a checklist-based evaluation framework. We evaluate two refinement modes: (1) guided refinement, where an LM is provided natural language feedback, and (2) self-refinement, where LMs attempt to improve without guidance. In the self-refinement setting, even frontier LMs such as Gemini 2.5 Pro and GPT-5 achieve modest baseline scores of 31.3% and 29.1%, respectively, and most models fail to consistently improve across iterations (e.g., Gemini-2.5-Pro gains only +1.8%, while DeepSeek-R1 declines by -0.1%). By contrast, in guided refinement, both proprietary LMs and large open-weight LMs (>70B) can leverage targeted feedback to refine responses to near-perfect levels within five turns. These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses, and that RefineBench provides a valuable testbed for tracking progress.

View Paper