Draft-based Approximate Inference for LLMs

Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee

2025-06-15

Draft-based Approximate Inference for LLMs

Summary

This paper talks about a new way to make large language models work faster and better when they have to understand or use very long pieces of text. It introduces a framework that uses smaller, simpler 'draft' models to guess which parts of the text and internal information are most important, helping the bigger model focus only on the key parts.

What's the problem?

The problem is that when large language models process long texts, they need a lot of computer memory and time because they pay attention to everything, which grows very quickly as the text gets longer. Existing shortcuts to make this faster often guess incorrectly about which parts are important, lowering the accuracy of the model's output.

What's the solution?

The solution was to use these smaller draft models to create a preview or rough guess of the important parts before the main model processes everything fully. This lets the system drop or ignore less important information more accurately, reducing memory use and speeding up the work without losing much accuracy. They developed two methods: one to decide which internal pieces (key-value pairs) to keep and another to compress the prompt by dropping unneeded tokens, both guided by the draft model's predictions.

Why it matters?

This matters because it helps huge language models handle very long texts more efficiently, saving time and computer resources while keeping answers accurate. It makes advanced AI tools more usable for tasks that involve reading or writing long documents, making AI technology more practical and accessible.

Abstract

A new framework using draft models enhances approximate inference for long-context LLMs by better predicting token and key-value pair importance, improving accuracy while maintaining memory and compute efficiency.

View Paper