EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang
2025-03-10
Summary
This paper talks about EAGLE-3, a new method to make large language models (LLMs) work faster without losing quality, by improving on previous techniques like EAGLE and speculative sampling
What's the problem?
Large language models are powerful but slow and expensive to run. While methods like EAGLE have helped speed things up, they don't fully benefit from using more training data, which is a common way to make AI smarter
What's the solution?
The researchers created EAGLE-3, which changes how the AI predicts what comes next in a sentence. Instead of guessing features, it directly predicts words. It also uses information from multiple layers of the AI, not just the top layer. These changes allow EAGLE-3 to take full advantage of more training data, making it faster and smarter
Why it matters?
This matters because it makes AI language models up to 6.5 times faster than normal and 1.4 times faster than the previous version, EAGLE-2. Faster AI models can be used more widely and cheaply, potentially improving things like chatbots, translation services, and other AI tools that work with language. By making the code available online, other researchers can use and improve this technique
Abstract
The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. The code is available at https://github.com/SafeAILab/EAGLE.