SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning

Tianjian Li, Daniel Khashabi

2025-05-09

SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in
Language Model Preference Learning

Summary

This paper talks about SIMPLEMIX, a straightforward way to improve how language models learn what people like by mixing two types of training data: data based on the model's own actions and data collected from outside sources.

What's the problem?

The problem is that teaching language models to understand and match human preferences can be tricky, especially when deciding how to use different kinds of training data. Many existing methods for combining these data types are complicated and don't always work well.

What's the solution?

The researchers showed that simply mixing on-policy data (from the model itself) with off-policy data (from other sources) actually works better than more complicated methods. This simple approach helps the model learn to align with what people want across a variety of tasks.

Why it matters?

This matters because it makes training language models to understand and follow human preferences easier and more effective. As a result, AI assistants and chatbots can become more helpful, trustworthy, and better at giving people what they actually want.

Abstract

A combination of on-policy and off-policy data enhances language model alignment across various tasks, outperforming complex integration methods.

View Paper