On scalable oversight with weak LLMs judging strong LLMs
Zachary Kenton, Noah Y. Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, Rohin Shah
2024-07-08

Summary
This paper talks about how to improve the way humans supervise advanced AI systems by using a method called debate, where two AI models argue with each other to convince a judge, compared to other methods like consultancy and direct question-answering.
What's the problem?
The main problem is that as AI systems become more powerful, it can be challenging for humans to effectively oversee them and ensure they provide accurate answers. Traditional methods like having one AI answer questions (consultancy) or just letting a judge answer directly can lead to mistakes, especially when the judge is not as capable as the AI. This creates a need for better ways to evaluate and supervise these advanced AI models.
What's the solution?
To tackle this issue, the authors studied different oversight methods: debate (where two AIs argue), consultancy (where one AI tries to convince a judge), and direct question-answering (where the judge answers without AI help). They found that debate generally works better than consultancy because it allows judges to hear both sides of an argument, making it less likely for them to be misled. Additionally, when debaters could choose which answer to argue for, judges were less often convinced by incorrect answers compared to when using consultancy. Stronger debater models also helped improve the accuracy of the judges' decisions.
Why it matters?
This research is important because it provides insights into how we can effectively supervise powerful AI systems. By showing that debate can lead to better decision-making, it suggests new ways to ensure that AI provides accurate and reliable information. This has significant implications for fields where AI is used for critical tasks, such as healthcare, finance, and education, where making the right decisions is essential.
Abstract
Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and compare to a baseline of direct question-answering, where the judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed. Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy. Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies.