LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information

Bowen Ping, Jiali Zeng, Fandong Meng, Shuo Wang, Jie Zhou, Shanghang Zhang

2025-02-04

LongDPO: Unlock Better Long-form Generation Abilities for LLMs via
Critique-augmented Stepwise Information

Summary

This paper talks about LongDPO, a new method to improve how AI models generate long pieces of text, like academic papers or code. It focuses on making the text more accurate, better in quality, and closer to the required length by using a step-by-step approach with detailed feedback.

What's the problem?

Current AI models, including advanced ones like GPT-4o, struggle with creating long texts that meet specific requirements. They often produce content that is either too short or too long and lacks the necessary quality because they don’t get detailed feedback during the writing process. This makes it hard for these models to handle complex tasks effectively.

What's the solution?

The researchers developed LongDPO, which uses a process called stepwise supervision to improve long-text generation. They employed Monte Carlo Tree Search to gather feedback at each step of the writing process and used a global memory pool to keep the content consistent. They also added external critiques to refine the AI's choices and applied a training method called step-level DPO to teach the model how to generate better text at each stage. This approach improved both the length and quality of generated text while maintaining strong performance on general tasks.

Why it matters?

This research is important because it helps AI models create longer and more complex texts that are accurate and high-quality, which is essential for tasks like academic writing or detailed code generation. By improving how these models handle long-form content, LongDPO makes AI more useful for real-world applications where precision and structure are critical.

Abstract

Long-form generation is crucial for academic writing papers and repo-level code generation. Despite this, current models, including GPT-4o, still exhibit unsatisfactory performance. Existing methods that utilize preference learning with outcome supervision often fail to provide detailed feedback for extended contexts. This shortcoming can lead to content that does not fully satisfy query requirements, resulting in issues like length deviations, and diminished quality. In this paper, we propose enhancing long-form generation by incorporating process supervision. We employ Monte Carlo Tree Search to gather stepwise preference pairs, utilizing a global memory pool to maintain consistency. To address the issue of suboptimal candidate selection, we integrate external critiques to refine and improve the quality of the preference pairs. Finally, we apply step-level DPO using the collected stepwise preference pairs. Experimental results show that our method improves length and quality on long-form generation benchmarks, with almost lossless performance on general benchmarks across various model backbones.

View Paper