LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li

2024-09-05

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

Summary

This paper talks about LongCite, a new method that helps large language models (LLMs) generate answers with detailed citations, making it easier to verify the information they provide.

What's the problem?

Although current long-context LLMs can answer questions based on extensive text, they often do not include citations in their responses. This lack of citations makes it hard for users to check the accuracy of the information and raises concerns about the reliability of the answers since the models can sometimes make up information.

What's the solution?

To solve this problem, the authors introduced a new benchmark called LongBench-Cite to evaluate how well LLMs perform in generating answers with citations. They developed a process called CoF (Coarse to Fine) that uses existing LLMs to automatically create question-and-answer pairs with precise sentence-level citations. They then used this process to build a large dataset called LongCite-45k. Finally, they trained two models, LongCite-8B and LongCite-9B, using this dataset so that these models could generate accurate answers along with citations in their responses.

Why it matters?

This research is important because it improves the trustworthiness of AI-generated responses by ensuring that users can easily verify the information provided. By enabling LLMs to include detailed citations, LongCite enhances the usability of these models in academic and professional settings where accuracy is crucial.

Abstract

Though current long-context large language models (LLMs) have demonstrated impressive capacities in answering user questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to their potential hallucinations. In this work, we aim to enable long-context LLMs to generate responses with fine-grained sentence-level citations, improving their faithfulness and verifiability. We first introduce LongBench-Cite, an automated benchmark for assessing current LLMs' performance in Long-Context Question Answering with Citations (LQAC), revealing considerable room for improvement. To this end, we propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using the LongCite-45k dataset, successfully enabling their generation of accurate responses and fine-grained sentence-level citations in a single output. The evaluation results on LongBench-Cite show that our trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4o.

View Paper