BLEUBERI: BLEU is a surprisingly effective reward for instruction following
Yapei Chang, Yekyung Kim, Michael Krumdick, Amir Zadeh, Chuan Li, Chris Tanner, Mohit Iyyer
2025-05-22
Summary
This paper talks about BLEUBERI, a new approach that uses something called BLEU, which is a simple way to measure how similar two pieces of text are, as a reward to help language models get better at following instructions.
What's the problem?
Getting language models to follow instructions well usually requires complex reward systems that are hard to build and fine-tune, making the whole process more complicated and time-consuming.
What's the solution?
The researchers showed that by using BLEU as the main reward during training, they could make language models that follow instructions just as well as those trained with much more complicated reward setups.
Why it matters?
This matters because it makes training helpful and accurate AI models much easier and faster, which is great for creating better virtual assistants, chatbots, and other tools that need to understand and follow what people ask.
Abstract
BLEUBERI uses BLEU as a reward function to optimize instruction-following in Language Models, achieving quality comparable to models aligned with full reward models.