LongIns: A Challenging Long-context Instruction-based Exam for LLMs
Shawn Gavin, Tuney Zheng, Jiaheng Liu, Quehry Que, Noah Wang, Jian Yang, Chenchen Zhang, Wenhao Huang, Wenhu Chen, Ge Zhang
2024-06-26

Summary
This paper introduces LongIns, a new benchmark designed to test how well large language models (LLMs) can handle long contexts when following instructions. It aims to provide a better understanding of LLM capabilities beyond just retrieving information.
What's the problem?
Many existing benchmarks for evaluating LLMs mainly focus on how well these models can find key information in texts, which doesn't fully capture their reasoning abilities. Additionally, while LLMs claim to support very long context lengths (like 32k or even 200k tokens), these benchmarks often do not accurately test how well the models perform with such long inputs, leaving questions about their true capabilities.
What's the solution?
To tackle these issues, the authors created the LongIns benchmark, which includes a variety of tasks that require reasoning over long contexts. They developed three evaluation settings: Global Instruction & Single Task (GIST), Local Instruction & Single Task (LIST), and Local Instruction & Multiple Tasks (LIMT). This structure allows for a more comprehensive assessment of LLM performance across different scenarios. The authors then tested several LLMs using LongIns and found that even top models like GPT-4 struggled with certain context lengths, particularly when faced with complex reasoning tasks.
Why it matters?
This research is important because it helps clarify the actual capabilities of LLMs when dealing with long texts. By providing a more rigorous way to evaluate how these models understand and reason with extended information, LongIns can guide future improvements in LLM design and training, ultimately leading to more effective AI systems for various applications.
Abstract
The long-context capabilities of large language models (LLMs) have been a hot topic in recent years. To evaluate the performance of LLMs in different scenarios, various assessment benchmarks have emerged. However, as most of these benchmarks focus on identifying key information to answer questions, which mainly requires the retrieval ability of LLMs, these benchmarks can partially represent the reasoning performance of LLMs from large amounts of information. Meanwhile, although LLMs often claim to have context windows of 32k, 128k, 200k, or even longer, these benchmarks fail to reveal the actual supported length of these LLMs. To address these issues, we propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs, which is built based on the existing instruction datasets. Specifically, in our LongIns, we introduce three evaluation settings: Global Instruction & Single Task (GIST), Local Instruction & Single Task (LIST), and Local Instruction & Multiple Tasks (LIMT). Based on LongIns, we perform comprehensive evaluations on existing LLMs and have the following important findings: (1). The top-performing GPT-4 with 128k context length performs poorly on the evaluation context window of 16k in our LongIns. (2). For the multi-hop reasoning ability of many existing LLMs, significant efforts are still needed under short context windows (less than 4k).