LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization
Guanzheng Chen, Xin Li, Michael Qizhe Shieh, Lidong Bing
2025-02-20
Summary
This paper talks about LongPO, a new method to help AI models perform better when working with long pieces of text while still keeping their ability to handle shorter texts. It's like teaching a student to read and understand long books without forgetting how to handle shorter passages.
What's the problem?
AI models that are good at understanding short texts often struggle when dealing with longer texts because they aren't properly trained for extended contexts. Fixing this usually requires a lot of human annotations for long texts, which is expensive and time-consuming. Additionally, improving their long-text abilities can sometimes make them worse at handling short texts.
What's the solution?
The researchers created LongPO, which allows AI models to learn from their own outputs by comparing responses to short and long versions of the same text. This method teaches the model how to transfer its skills from short texts to longer ones while using a special constraint to ensure its performance on short texts doesn't decline. They tested this approach on an AI model, showing it could handle longer texts as well as top-tier models like GPT-4, all without needing extensive human annotations.
Why it matters?
This matters because it makes AI models more versatile and efficient, allowing them to handle both short and long texts effectively. By reducing the need for expensive human-annotated data, LongPO could help create smarter AI systems that work well in real-world situations, such as analyzing lengthy reports or maintaining context in long conversations.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities through pretraining and alignment. However, superior short-context LLMs may underperform in long-context scenarios due to insufficient long-context alignment. This alignment process remains challenging due to the impracticality of human annotation for extended contexts and the difficulty in balancing short- and <PRE_TAG>long-context performance</POST_TAG>. To address these challenges, we introduce LongPO, that enables short-context LLMs to self-evolve to excel on long-context tasks by internally transferring short-context capabilities. LongPO harnesses LLMs to learn from self-generated <PRE_TAG><PRE_TAG>short-to-long preference data</POST_TAG></POST_TAG>, comprising paired responses generated for identical instructions with long-context inputs and their compressed short-context counterparts, respectively. This preference reveals capabilities and potentials of LLMs cultivated during short-context alignment that may be diminished in under-aligned <PRE_TAG>long-context scenarios</POST_TAG>. Additionally, LongPO incorporates a short-to-long KL constraint to mitigate short-context performance decline during long-context alignment. When applied to Mistral-7B-Instruct-v0.2 from 128K to 512K context lengths, LongPO fully retains short-context performance and largely outperforms naive SFT and DPO in both long- and short-context tasks. Specifically, \ourMethod-trained models can achieve results on long-context benchmarks comparable to, or even surpassing, those of superior LLMs (e.g., GPT-4-128K) that involve extensive long-context annotation and larger parameter scales.