Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking

Benjamin Feuer, Micah Goldblum, Teresa Datta, Sanjana Nambiar, Raz Besaleli, Samuel Dooley, Max Cembalest, John P. Dickerson

2024-09-24

Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking

Summary

This paper discusses the limitations of using large language models (LLMs) as judges for evaluating AI alignment. It introduces a new benchmark called SOS-Bench to better measure how well these models align with human values and safety.

What's the problem?

As LLMs like ChatGPT have become popular, many methods have been developed to improve their alignment with human preferences. However, there is a concern that the preferences judged by LLMs do not always translate into real improvements in safety, knowledge, or following instructions. Additionally, LLM judges tend to favor stylistic elements (like how friendly the text sounds) over important factors such as factual accuracy and safety, which can lead to misleading evaluations.

What's the solution?

To address these issues, the researchers created SOS-Bench, a large and standardized benchmark for evaluating LLM alignment. They found that LLM judges' preferences do not correlate well with concrete measures of safety and instruction following. The study showed that the supervised fine-tuning stage of model training has a greater impact on alignment than preference optimization methods. The researchers also highlighted the importance of using diverse prompts and scaling data during training to improve model performance.

Why it matters?

This research is important because it helps clarify how we should evaluate AI models to ensure they are safe and effective. By introducing SOS-Bench, the study provides a better way to assess whether LLMs truly align with human values, which is crucial for developing reliable AI systems that can be trusted in real-world applications.

Abstract

The release of ChatGPT in November 2022 sparked an explosion of interest in post-training and an avalanche of new preference optimization (PO) methods. These methods claim superior alignment by virtue of better correspondence with human pairwise preferences, often measured by LLM judges. In this work, we attempt to answer the following question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We define a concrete metric for alignment, and introduce SOS-Bench, the largest standardized, reproducible LLM meta-benchmark to date. We find that (1) LLM-judgments do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning (SFT) stage of post-training, and not the PO stage, has the greatest impact on alignment, with data scaling and prompt diversity as the driving factors. Our codebase and complete results can be found at https://github.com/penfever/sos-bench.

View Paper