Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
Yu Ying Chiu, Zhilin Wang, Sharan Maiya, Yejin Choi, Kyle Fish, Sydney Levine, Evan Hubinger
2025-05-21
Summary
This paper talks about testing AI models to see what values they follow, especially in tough situations where they might have to choose between telling the truth or protecting someone, like saving sick children.
What's the problem?
It's hard to know what an AI will do in morally tricky situations, and if we can't predict its choices, it could make decisions that are risky or go against what people expect.
What's the solution?
The researchers used special tests called LitmusValues, AIRiskDilemmas, and HarmBench to figure out which values the AI models actually use when making decisions, helping to spot both obvious and hidden risky behaviors.
Why it matters?
This is important because understanding and predicting how AI will act in real-life dilemmas makes it safer and more trustworthy, especially when it's used in situations that affect people's lives.
Abstract
Identifying values within AI models using LitmusValues and evaluating them through AIRiskDilemmas and HarmBench can predict both known and unknown risky behaviors.