DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

Shuaijie She, Yu Bao, Yu Lu, Lu Xu, Tao Li, Wenhao Zhu, Shujian Huang, Shanbo Cheng, Lu Lu, Yuxuan Wang

2025-08-21

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

Summary

This paper introduces DuPO, a new method that helps train AI models, especially large language models (LLMs), by using a clever technique that doesn't require human-labeled examples. It works by breaking down a task and then using a related but slightly different task to check and improve the first one, like solving a puzzle by working backward.

What's the problem?

Existing methods for improving AI models often need lots of human-labeled data, which is expensive and time-consuming, or they only work for specific types of problems where you can easily reverse the process, like translating a sentence and then translating it back. This limits how widely these improvement methods can be used.

What's the solution?

DuPO tackles these problems by creating a 'dual' task for any given task. It splits the original task's input into parts we know and parts we don't. Then, the dual task's goal is to reconstruct the unknown parts using the output of the original task and the known information. This reconstruction acts as a way to automatically check how well the original task is doing, without needing any external labels. This works even for tasks that aren't easily reversible, by cleverly designing the dual task.

Why it matters?

This research is important because it offers a way to improve AI models more efficiently and broadly. By removing the need for human labels and expanding to more complex tasks, DuPO makes it easier and cheaper to build better AI systems. It shows significant improvements in areas like language translation and math problem-solving, demonstrating its potential to make AI more accurate and reliable across many different applications.

Abstract

We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning's restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task's input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs' ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.

View Paper