One Token to Fool LLM-as-a-Judge

Yulai Zhao, Haolin Liu, Dian Yu, S. Y. Kung, Haitao Mi, Dong Yu

2025-07-14

Summary

This paper talks about how generative reward models, which help AI decide if its answers are good, can be tricked by shallow tricks, causing the AI to get false rewards.

What's the problem?

Generative reward models sometimes give high scores to AI responses that only sound good on the surface but aren't actually right, leading to mistakes and overconfidence.

What's the solution?

The researchers introduced a new way to add more diverse training examples that help the reward models learn to recognize and avoid these shallow tricks, making them more robust and less likely to be fooled.

Why it matters?

This matters because it improves how AI judges its own work, making AI systems more reliable and trustworthy by reducing errors caused by false positive rewards.

Abstract

Generative reward models using large language models are vulnerable to superficial manipulations, leading to false positive rewards, and a new data augmentation strategy improves their robustness.

View Paper