Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, Tyler Poon, Max Ehrlich, Tuomas Rintamaki, Tyler Poon, Tong Lu, Limin Wang, Bryan Catanzaro, Jan Kautz, Andrew Tao, Zhiding Yu, Guilin Liu

2025-04-22

Eagle 2.5: Boosting Long-Context Post-Training for Frontier
Vision-Language Models

Summary

This paper talks about Eagle 2.5, a new vision-language model that can understand and process long videos and high-resolution images really well, using smart training techniques to keep its performance high even when handling lots of information at once.

What's the problem?

The problem is that most AI models struggle to handle long videos or very detailed images because they either lose important context or require huge amounts of computer power, making them slow and expensive to use.

What's the solution?

The researchers created Eagle 2.5, which uses special methods called Automatic Degrade Sampling and Image Area Preservation to keep important details and context while training the model. They also built a new dataset for long videos, helping the model learn to understand both storylines and small details. This allows Eagle 2.5 to match or beat much larger models on tough tests, all while staying efficient and not needing massive computer resources.

Why it matters?

This matters because it means powerful AI for video and image understanding can now be used on regular hardware, making advanced technology more accessible for things like video analysis, education, and creative projects.

Abstract

Eagle 2.5 improves long-context multimodal learning through Automatic Degrade Sampling and Image Area Preservation techniques, enhancing VLMs for video and image understanding, and matches state-of-the-art performance on benchmarks.

View Paper