LongGenBench: Long-context Generation Benchmark

Xiang Liu, Peijie Dong, Xuming Hu, Xiaowen Chu

2024-10-09

LongGenBench: Long-context Generation Benchmark

Summary

This paper introduces LongGenBench, a new benchmark designed to evaluate how well large language models (LLMs) can generate long pieces of text while following specific instructions, filling a gap in existing tests.

What's the problem?

Most current benchmarks for testing LLMs focus on retrieving information from long texts rather than generating new, coherent text. This means there aren't enough tests to see how well these models can create long-form content that makes sense and follows detailed prompts, which is important for real-world applications like writing and design.

What's the solution?

The authors created LongGenBench, a synthetic benchmark that allows for flexible testing of LLMs on generating long texts. It includes various tasks where models must produce cohesive answers based on complex instructions over lengthy contexts. They tested several state-of-the-art LLMs using this benchmark and found that while these models perform well on retrieval tasks, they struggle with generating long texts, especially as the required length increases.

Why it matters?

This research matters because it highlights the limitations of current LLMs in generating high-quality long-form text. By introducing LongGenBench, the authors aim to improve how these models are evaluated and encourage advancements in their ability to produce coherent and contextually relevant content for practical uses like creative writing and technical documentation.

Abstract

Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evaluating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer. Upon extensive evaluation using LongGenBench, we observe that: (1) both API accessed and open source models exhibit performance degradation in long-context generation scenarios, ranging from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of performance degradation, with the Gemini-1.5-Flash model showing the least degradation among API accessed models, and the Qwen2 series exhibiting the least degradation in LongGenBench among open source models.

View Paper