Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest

Letian Peng, Zilong Wang, Feng Yao, Jingbo Shang

2025-02-18

Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest

Summary

This paper talks about Cuckoo, a new AI system for information extraction (IE), which is the process of pulling specific details from text. Cuckoo uses data from large language models (LLMs) to improve its ability to perform IE tasks without needing extra manual training.

What's the problem?

Many current methods for information extraction rely on large amounts of specially labeled data, which is hard to create and scale. This makes it difficult to train IE systems to handle a wide range of tasks effectively. Additionally, existing IE models often can't take full advantage of the massive datasets used to train LLMs.

What's the solution?

The researchers created Cuckoo, an IE model that uses a method called next-token extraction (NTE). This approach reframes the way LLMs predict the next word in a sentence into a way to extract information from text. By converting data already used in LLM training into examples for IE tasks, Cuckoo avoids the need for expensive manual labeling. It can adapt to both simple and complex tasks by leveraging the vast resources of LLMs, making it more efficient and scalable.

Why it matters?

This matters because it shows how we can use existing AI training data to improve specialized tasks like information extraction without needing extra effort or resources. Cuckoo's ability to evolve with advancements in LLMs could make it easier and cheaper to build systems that extract useful information from text, benefiting fields like research, business, and education.

Abstract

Massive high-quality data, both pre-training raw texts and post-training annotations, have been carefully prepared to incubate advanced large language models (LLMs). In contrast, for information <PRE_TAG>extraction (IE)</POST_TAG>, pre-training data, such as BIO-tagged sequences, are hard to scale up. We show that IE models can act as free riders on LLM resources by reframing next-token prediction into extraction for tokens already present in the context. Specifically, our proposed next tokens <PRE_TAG>extraction (NTE)</POST_TAG> paradigm learns a versatile IE model, Cuckoo, with 102.6M extractive data converted from LLM's pre-training and post-training data. Under the few-shot setting, Cuckoo adapts effectively to traditional and complex instruction-following IE with better performance than existing pre-trained IE models. As a free rider, Cuckoo can naturally evolve with the ongoing advancements in LLM data preparation, benefiting from improvements in LLM training pipelines without additional manual effort.

View Paper