Judging with Confidence: Calibrating Autoraters to Preference Distributions

Zhuohang Li, Xiaowei Li, Chengyu Huang, Guowang Li, Katayoon Goshvadi, Bo Dai, Dale Schuurmans, Paul Zhou, Hamid Palangi, Yiwen Song, Palash Goyal, Murat Kantarcioglu, Bradley A. Malin, Yuan Xue

2025-10-07

Judging with Confidence: Calibrating Autoraters to Preference Distributions

Summary

This paper investigates how to make AI systems that judge other AI systems more reliable when trying to align with what humans want. These judging AI systems, called 'autoraters', are becoming really important for improving large language models.

What's the problem?

Currently, autoraters are trained by being told what the 'best' answer is, essentially forcing a single correct solution onto things that are often a matter of opinion or have many good answers. This is a problem because human preferences aren't always clear-cut; people have different tastes and opinions. If an autorater only learns one 'right' answer, it can't accurately reflect the variety of human preferences.

What's the solution?

The researchers developed a way to train autoraters to understand the *range* of human preferences, not just a single 'correct' answer. They created a framework to 'calibrate' these autoraters to match how a group of people actually feel about different options. They used two different methods: one for when they have detailed preference data, and another for when they only have simple 'yes' or 'no' feedback. Essentially, they taught the autoraters to predict how likely a person would be to prefer one answer over another.

Why it matters?

This work is important because more reliable autoraters mean we can build AI systems that are better aligned with human values. If we can accurately model what people want, we can create AI that is more helpful, safe, and trustworthy. It also reduces bias in the AI judging process, making it fairer and more representative of diverse opinions.

Abstract

The alignment of large language models (LLMs) with human values increasingly relies on using other LLMs as automated judges, or ``autoraters''. However, their reliability is limited by a foundational issue: they are trained on discrete preference labels, forcing a single ground truth onto tasks that are often subjective, ambiguous, or nuanced. We argue that a reliable autorater must learn to model the full distribution of preferences defined by a target population. In this paper, we propose a general framework for calibrating probabilistic autoraters to any given preference distribution. We formalize the problem and present two learning methods tailored to different data conditions: 1) a direct supervised fine-tuning for dense, probabilistic labels, and 2) a reinforcement learning approach for sparse, binary labels. Our empirical results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution, with improved calibration and significantly lower positional bias, all while preserving performance on objective tasks.

View Paper