Self-Taught Evaluators

Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, Xian Li

2024-08-06

Summary

This paper discusses a new method called Self-Taught Evaluators, which improves the evaluation of AI models without needing human feedback by using synthetic data instead.

What's the problem?

Evaluating AI models accurately is crucial for their development, but the traditional way of doing this relies on collecting a lot of human opinions about model outputs. This process is expensive and time-consuming, and as models get better, the old human feedback becomes less relevant.

What's the solution?

The authors propose a method that allows evaluators to improve themselves without any human input. They start with unlabeled instructions and use an iterative process to generate different outputs from the model. Then, they train a large language model (LLM) to act as a judge, producing reasoning and final evaluations based on these outputs. This self-improvement cycle continues with each new iteration, leading to better evaluations over time.

Why it matters?

The Self-Taught Evaluator significantly boosts the performance of an existing LLM, showing that it can achieve high evaluation scores without needing labeled data. This approach not only saves time and resources but also keeps pace with advancements in AI models, making it a valuable tool for future AI development.

Abstract

Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to im-prove evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.

View Paper