HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

Haoran Li, Yingjie Qin, Baoyuan Ou, Lai Xu, Ruiwen Xu

2025-05-29

HoPE: Hybrid of Position Embedding for Length Generalization in
Vision-Language Models

Summary

This paper talks about HoPE, a new method that helps AI models better understand and process long videos by improving how they keep track of the order and timing of events in both images and text.

What's the problem?

The problem is that when AI tries to make sense of long videos, it often struggles to remember what happened earlier or to connect events that are far apart in time. This makes it hard for the AI to answer questions or summarize what happened in the video, especially if the video is really long.

What's the solution?

The researchers created HoPE, which combines different ways of marking the position of things in a video so the AI can more easily follow the sequence of events. By using smarter frequency allocation and being able to adjust how it handles time, the model can keep track of what's happening over longer periods and do a better job at understanding long and complex videos.

Why it matters?

This is important because it allows AI to be more helpful in real-life situations, like analyzing security footage, making sense of movies, or helping people learn from educational videos. With better memory and understanding of long videos, AI becomes much more useful and reliable.

Abstract

HoPE, a Hybrid of Position Embedding, enhances VLMs' long-context performance in videos through improved frequency allocation and dynamic temporal scaling.

View Paper