Annotation-Efficient Universal Honesty Alignment

Shiyu Ni, Keping Bi, Jiafeng Guo, Minghao Tang, Jingtong Wu, Zengxin Han, Xueqi Cheng

2025-10-21

Annotation-Efficient Universal Honesty Alignment

Summary

This paper focuses on making large language models (LLMs) more trustworthy by ensuring they know when they *don't* know something and can accurately express their confidence levels. It's about getting these AI models to be 'honest' in their responses.

What's the problem?

Currently, making LLMs honest requires either figuring out their confidence without any extra training, or training them with lots of examples that specifically tell them whether their answers are right or wrong. The problem is that getting enough of those 'right or wrong' examples is really expensive and time-consuming, making it hard to apply this 'honesty training' to all LLMs.

What's the solution?

The researchers came up with a new method called EliCal. It works in two steps: first, it uses a simple technique called 'self-consistency' – basically, having the model answer the same question multiple times and seeing if the answers agree – to get a rough idea of how confident the model is. Then, it uses a *small* number of 'right or wrong' examples to fine-tune that initial confidence level and make it more accurate. They also created a large dataset, HonestyBench, to test their method.

Why it matters?

This work is important because it offers a way to make LLMs more reliable without needing a huge amount of labeled data. By using self-consistency as a starting point, they can achieve almost the same level of honesty with only a tiny fraction of the usual training examples, making it more practical to build trustworthy AI systems.

Abstract

Honesty alignment-the ability of large language models (LLMs) to recognize their knowledge boundaries and express calibrated confidence-is essential for trustworthy deployment. Existing methods either rely on training-free confidence estimation (e.g., token probabilities, self-consistency) or training-based calibration with correctness annotations. While effective, achieving universal honesty alignment with training-based calibration requires costly, large-scale labeling. To support annotation-efficient training, we introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations. To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals. Experiments show that EliCal achieves near-optimal alignment with only 1k correctness annotations (0.18% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline, offering a scalable solution toward universal honesty alignment in LLMs.

View Paper