Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding

Maciej Skorski, Alina Landowska

2025-08-20

Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding

Summary

This research investigates how well large language models, like Claude, DeepSeek, and Llama, can understand moral issues compared to people, by looking at over 250,000 human judgments on more than 100,000 texts from various online sources.

What's the problem?

We don't really know if AI can grasp the nuances of human morality as well as humans do, and if they understand things differently, especially when people themselves disagree on what's right or wrong.

What's the solution?

The researchers used a smart statistical method called a Bayesian framework that accounts for the fact that humans often disagree on moral judgments. They tested several top AI models on a massive dataset of texts and human opinions, analyzing how the AI's moral assessments compared to the humans'.

Why it matters?

This study is important because it shows that AI models are surprisingly good at understanding moral dimensions, often performing better than the average person and being more likely to spot morally questionable content, which is crucial as AI becomes more integrated into our lives.

Abstract

How do large language models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluate top language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from ~700 annotators on 100K+ texts spanning social media, news, and forums. Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25\% of human annotators, achieving much better-than-average balanced accuracy. Importantly, we find that AI produces far fewer false negatives than humans, highlighting their more sensitive moral detection capabilities.

View Paper