Robust Reward Modeling via Causal Rubrics

Pragya Srivastava, Harman Singh, Rahul Madhavan, Gandharv Patil, Sravanti Addepalli, Arun Suggala, Rengarajan Aravamudhan, Soumya Sharma, Anirban Laha, Aravindan Raghuveer, Karthikeyan Shanmugam, Doina Precup

2025-06-24

Robust Reward Modeling via Causal Rubrics

Summary

This paper talks about Crome, a new framework developed to make reward models for AI more reliable and accurate by using causal reasoning and special data augmentations.

What's the problem?

The problem is that current reward models can be tricked by irrelevant or misleading features in data, causing them to give wrong rewards and leading AI to behave unexpectedly or badly.

What's the solution?

The researchers introduced Crome, which uses two types of synthetic training data: causal augmentations that focus on real important qualities like factual accuracy, and neutral augmentations that ignore irrelevant differences like style. This approach makes the reward model focus on true qualities and avoid being fooled by spurious details.

Why it matters?

This matters because better reward models help train AI systems that behave more safely, fairly, and effectively by understanding what really matters, reducing errors caused by misleading information.

Abstract

Crome, a novel reward modeling framework using causal and neutral augmentations, significantly improves the robustness and accuracy of reward models against reward hacking.

View Paper