L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding?
Zecheng Tang, Keyan Zhou, Juntao Li, Baibei Ji, Jianye Hou, Min Zhang
2024-10-04

Summary
This paper introduces L-CiteEval, a new benchmark designed to evaluate how well long-context models (LCMs) understand and use lengthy information when responding to questions.
What's the problem?
Long-context models have improved in handling tasks that require processing large amounts of information, like summarizing documents. However, just being accurate isn't enough because it's hard for people to check if the results are correct when the context is very long. Existing methods to test these models either focus on specific tasks or rely on other tools, which may not give a complete picture of how well the models are really using the context.
What's the solution?
To solve this issue, the authors developed L-CiteEval, which includes 11 different tasks that test LCMs across various domains and context lengths from 8,000 to 48,000 tokens. This benchmark evaluates not only how well the models generate answers but also how accurately they provide supporting evidence (citations) for those answers. By testing both closed-source and open-source LCMs, they found that open-source models often struggle to use the given context effectively, relying more on their general knowledge instead.
Why it matters?
This research is important because it provides a comprehensive way to assess the capabilities of long-context models in real-world applications. By focusing on both understanding and faithfulness in responses, L-CiteEval helps ensure that these models can be trusted to provide accurate and relevant information, which is crucial for tasks like research and information retrieval.
Abstract
Long-context models (LCMs) have made remarkable strides in recent years, offering users great convenience for handling tasks that involve long context, such as document summarization. As the community increasingly prioritizes the faithfulness of generated results, merely ensuring the accuracy of LCM outputs is insufficient, as it is quite challenging for humans to verify the results from the extremely lengthy context. Yet, although some efforts have been made to assess whether LCMs respond truly based on the context, these works either are limited to specific tasks or heavily rely on external evaluation resources like GPT-4.In this work, we introduce L-CiteEval, a comprehensive multi-task benchmark for long-context understanding with citations, aiming to evaluate both the understanding capability and faithfulness of LCMs. L-CiteEval covers 11 tasks from diverse domains, spanning context lengths from 8K to 48K, and provides a fully automated evaluation suite. Through testing with 11 cutting-edge closed-source and open-source LCMs, we find that although these models show minor differences in their generated results, open-source models substantially trail behind their closed-source counterparts in terms of citation accuracy and recall. This suggests that current open-source LCMs are prone to responding based on their inherent knowledge rather than the given context, posing a significant risk to the user experience in practical applications. We also evaluate the RAG approach and observe that RAG can significantly improve the faithfulness of LCMs, albeit with a slight decrease in the generation quality. Furthermore, we discover a correlation between the attention mechanisms of LCMs and the citation generation process.