Large Language Models Can Self-Improve in Long-context Reasoning
Siheng Li, Cheng Yang, Zesen Cheng, Lemao Liu, Mo Yu, Yujiu Yang, Wai Lam
2024-11-14

Summary
This paper discusses how large language models (LLMs) can improve their ability to reason over long pieces of text by using a method that allows them to learn from their own outputs instead of relying on human-created data.
What's the problem?
LLMs have made progress in understanding long texts, but they still find it challenging to reason through complex information that spans a lot of content. Current methods to enhance their reasoning skills often require extensive human input or advanced models, which limits their growth and adaptability.
What's the solution?
The authors introduce a new approach called SeaLong, which enables LLMs to self-improve in long-context reasoning. This method involves generating multiple answers for each question, scoring these answers based on how well they align with each other, and then refining the model using the best-scoring outputs. Experiments showed that this method significantly improved the reasoning abilities of models like Llama-3.1-8B-Instruct by 4.2 points compared to previous methods that relied on expert data.
Why it matters?
This research is important because it opens up new ways for AI models to enhance their reasoning skills independently, making them more efficient and capable without needing constant human oversight. This could lead to better performance in tasks that require understanding complex information over long texts, which is crucial as AI becomes more integrated into various applications.
Abstract
Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning. Existing approaches typically involve fine-tuning LLMs with synthetic data, which depends on annotations from human experts or advanced models like GPT-4, thus restricting further advancements. To address this issue, we investigate the potential for LLMs to self-improve in long-context reasoning and propose \ours, an approach specifically designed for this purpose. This approach is straightforward: we sample multiple outputs for each question, score them with Minimum Bayes Risk, and then apply supervised fine-tuning or preference optimization based on these outputs. Extensive experiments on several leading LLMs demonstrate the effectiveness of \ours, with an absolute improvement of 4.2 points for Llama-3.1-8B-Instruct. Furthermore, \ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models. We anticipate that this work will open new avenues for self-improvement techniques in long-context scenarios, which are essential for the continual advancement of LLMs.