RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
Qiyuan Zhang, Yufei Wang, Tiezheng YU, Yuxin Jiang, Chuhan Wu, Liangyou Li, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma
2024-10-09

Summary
This paper introduces RevisEval, a new method for evaluating the quality of text generated by large language models (LLMs) by using references that are specifically adapted to the responses being evaluated.
What's the problem?
While LLMs have become popular for judging the quality of text generation, they often do not match human evaluators in reliability. One major issue is that LLMs lack effective reference points (or 'guided oracles') to compare against when assessing generated text, leading to less accurate evaluations.
What's the solution?
RevisEval addresses this problem by creating 'response-adapted references.' This means that instead of using fixed references, the method revises the generated text to create a new reference that is directly related to the response being evaluated. This revised text is then used as a benchmark for judging the quality of the original response. The authors conducted experiments showing that this approach leads to better evaluation results compared to traditional methods.
Why it matters?
This research is important because it improves how we assess AI-generated text, making it more aligned with human judgment. By enhancing the evaluation process, RevisEval can lead to better quality in text generation tasks, which is crucial for applications like content creation, chatbots, and other areas where understanding and generating human-like text is essential.
Abstract
With significant efforts in recent studies, LLM-as-a-Judge has become a cost-effective alternative to human evaluation for assessing the text generation quality in a wide range of tasks. However, there still remains a reliability gap between LLM-as-a-Judge and human evaluation. One important reason is the lack of guided oracles in the evaluation process. Motivated by the role of reference pervasively used in classic text evaluation, we introduce RevisEval, a novel text generation evaluation paradigm via the response-adapted references. RevisEval is driven by the key observation that an ideal reference should maintain the necessary relevance to the response to be evaluated. Specifically, RevisEval leverages the text revision capabilities of large language models (LLMs) to adaptively revise the response, then treat the revised text as the reference (response-adapted reference) for the subsequent evaluation. Extensive experiments demonstrate that RevisEval outperforms traditional reference-free and reference-based evaluation paradigms that use LLM-as-a-Judge across NLG tasks and open-ended instruction-following tasks. More importantly, our response-adapted references can further boost the classical text metrics, e.g., BLEU and BERTScore, compared to traditional references and even rival the LLM-as-a-Judge. A detailed analysis is also conducted to confirm RevisEval's effectiveness in bias reduction, the impact of inference cost, and reference relevance.