Language Model Council: Benchmarking Foundation Models on Highly Subjective Tasks by Consensus

Justin Zhao, Flor Miriam Plaza-del-Arco, Amanda Cercas Curry

2024-06-14

Language Model Council: Benchmarking Foundation Models on Highly Subjective Tasks by Consensus

Summary

This paper introduces the Language Model Council (LMC), a new way to evaluate Large Language Models (LLMs) on subjective tasks like emotional intelligence and creative writing. The LMC uses a democratic process where multiple models judge each other's responses to create fairer rankings.

What's the problem?

Evaluating LLMs can be tricky, especially for tasks that are subjective, meaning different people might have different opinions on what makes a good response. Traditional methods often rely on a single model to judge the responses, which can lead to biased or inconsistent results. This is particularly problematic for tasks that require emotional understanding or creativity, where agreement among judges can be hard to achieve.

What's the solution?

To solve this problem, the authors created the LMC, where a group of 20 different LLMs collaborates to create tests, respond to them, and evaluate each other's answers. This collective approach allows for a more balanced assessment of each model's performance. The council members work together to formulate test sets and then act as judges, providing a more comprehensive evaluation of their capabilities.

Why it matters?

This research is important because it offers a new framework for assessing AI models in a way that reduces bias and increases fairness. By using multiple models as judges, the LMC can generate more reliable rankings that reflect human preferences better than traditional methods. This could lead to improvements in how AI systems understand and respond to complex human emotions and creative tasks, making them more effective in real-world applications.

Abstract

The rapid advancement of Large Language Models (LLMs) necessitates robust and challenging benchmarks. Leaderboards like Chatbot Arena rank LLMs based on how well their responses align with human preferences. However, many tasks such as those related to emotional intelligence, creative writing, or persuasiveness, are highly subjective and often lack majoritarian human agreement. Judges may have irreconcilable disagreements about what constitutes a better response. To address the challenge of ranking LLMs on highly subjective tasks, we propose a novel benchmarking framework, the Language Model Council (LMC). The LMC operates through a democratic process to: 1) formulate a test set through equal participation, 2) administer the test among council members, and 3) evaluate responses as a collective jury. We deploy a council of 20 newest LLMs on an open-ended emotional intelligence task: responding to interpersonal dilemmas. Our results show that the LMC produces rankings that are more separable, robust, and less biased than those from any individual LLM judge, and is more consistent with a human-established leaderboard compared to other benchmarks.

View Paper