BLEUBERI: BLEU is a surprisingly effective reward for instruction following

Yapei Chang, Yekyung Kim, Michael Krumdick, Amir Zadeh, Chuan Li, Chris Tanner, Mohit Iyyer

2025-05-22

BLEUBERI: BLEU is a surprisingly effective reward for instruction
following

Summary

This paper talks about BLEUBERI, a new approach that uses something called BLEU, which is a simple way to measure how similar two pieces of text are, as a reward to help language models get better at following instructions.

What's the problem?

Getting language models to follow instructions well usually requires complex reward systems that are hard to build and fine-tune, making the whole process more complicated and time-consuming.

What's the solution?

The researchers showed that by using BLEU as the main reward during training, they could make language models that follow instructions just as well as those trained with much more complicated reward setups.

Why it matters?

This matters because it makes training helpful and accurate AI models much easier and faster, which is great for creating better virtual assistants, chatbots, and other tools that need to understand and follow what people ask.

Abstract

BLEUBERI uses BLEU as a reward function to optimize instruction-following in Language Models, achieving quality comparable to models aligned with full reward models.

View Paper