ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities
Peng Xu, Wei Ping, Xianchao Wu, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro
2024-07-22

Summary
This paper introduces ChatQA 2, a new model based on Llama3 that aims to improve how large language models (LLMs) handle long inputs and enhance their ability to retrieve information from external sources. It seeks to make open-access models more competitive with advanced proprietary models like GPT-4.
What's the problem?
Many existing LLMs struggle with long-context understanding, meaning they can't effectively process large amounts of information at once. This makes it difficult for them to generate accurate responses when faced with complex queries or lengthy documents. Additionally, there is a need for better retrieval-augmented generation (RAG) capabilities, which allow models to pull in relevant information from external databases to support their answers.
What's the solution?
The authors of the paper developed ChatQA 2 by extending the context window of the Llama3 model from 8,000 tokens to an impressive 128,000 tokens. This allows the model to remember and process much more information at once. They also implemented a three-stage training process to improve the model's ability to follow instructions and perform RAG tasks effectively. Their experiments showed that ChatQA 2 can achieve accuracy similar to or better than proprietary models like GPT-4 on many long-context tasks.
Why it matters?
This research is important because it enhances the capabilities of open-access LLMs, making them more competitive with commercial models. By improving long-context understanding and retrieval abilities, ChatQA 2 can help users get better answers in applications like question-answering systems, making AI tools more effective and accessible for everyone.
Abstract
In this work, we introduce ChatQA 2, a Llama3-based model designed to bridge the gap between open-access LLMs and leading proprietary models (e.g., GPT-4-Turbo) in long-context understanding and retrieval-augmented generation (RAG) capabilities. These two capabilities are essential for LLMs to process large volumes of information that cannot fit into a single prompt and are complementary to each other, depending on the downstream tasks and computational budgets. We present a detailed continued training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens, along with a three-stage instruction tuning process to enhance the model's instruction-following, RAG performance, and long-context understanding capabilities. Our results demonstrate that the Llama3-ChatQA-2-70B model achieves accuracy comparable to GPT-4-Turbo-2024-0409 on many long-context understanding tasks and surpasses it on the RAG benchmark. Interestingly, we find that the state-of-the-art long-context retriever can alleviate the top-k context fragmentation issue in RAG, further improving RAG-based results for long-context understanding tasks. We also provide extensive comparisons between RAG and long-context solutions using state-of-the-art long-context LLMs.