LongKey: Keyphrase Extraction for Long Documents

Jeovane Honorio Alves, Radu State, Cinthia Obladen de Almendra Freitas, Jean Paul Barddal

2024-11-29

LongKey: Keyphrase Extraction for Long Documents

Summary

This paper presents LongKey, a new framework designed to automatically extract keyphrases from long documents, improving how we identify important terms in lengthy texts.

What's the problem?

With the increasing amount of information available, manually finding and highlighting key terms in long documents is impractical. Most existing methods for extracting keyphrases work well only for short texts, which leaves a gap when dealing with longer documents that contain more complex information.

What's the solution?

LongKey addresses this issue by using an advanced language model that can handle lengthy texts. It employs a technique called max-pooling to better represent potential keyphrases. The framework has been tested on various datasets, showing that it performs better than previous methods for both short and long documents, making it versatile and effective.

Why it matters?

This research is important because it helps automate the process of identifying key information in long documents, which can save time and improve efficiency in fields like research, education, and business. By enhancing how we extract keyphrases, LongKey can help people quickly find relevant information in large amounts of text, making it easier to understand complex topics.

Abstract

In an era of information overload, manually annotating the vast and growing corpus of documents and scholarly papers is increasingly impractical. Automated keyphrase extraction addresses this challenge by identifying representative terms within texts. However, most existing methods focus on short documents (up to 512 tokens), leaving a gap in processing long-context documents. In this paper, we introduce LongKey, a novel framework for extracting keyphrases from lengthy documents, which uses an encoder-based language model to capture extended text intricacies. LongKey uses a max-pooling embedder to enhance keyphrase candidate representation. Validated on the comprehensive LDKP datasets and six diverse, unseen datasets, LongKey consistently outperforms existing unsupervised and language model-based keyphrase extraction methods. Our findings demonstrate LongKey's versatility and superior performance, marking an advancement in keyphrase extraction for varied text lengths and domains.

View Paper