VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin

2025-02-10

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Summary

This paper introduces Goku AI, a new video creation tool that uses smart math (called rectified flow Transformers) to make super realistic videos from text or pictures. It works better than older methods by combining image and video generation into one system.

What's the problem?

Existing AI tools struggle to make videos that look smooth and natural. They often create choppy movements, weird faces, or scenes that don't match the text description properly. It's especially hard to keep things consistent when making longer videos or switching between different types of content.

What's the solution?

The team built Goku using a special math approach (rectified flow) that helps the AI plan smoother video transitions. They trained it on massive amounts of high-quality pictures and videos, and designed the system to handle both images and videos together instead of separately. They also created better ways to filter training data and optimize the computer processing.

Why it matters?

This matters because it helps creators make professional-looking videos faster and cheaper - imagine turning a product photo into a TV commercial instantly. Better AI video tools could transform movies, advertising, education, and social media while making content creation accessible to more people.

Abstract

While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce VideoRoPE, with a 3D structure designed to preserve spatio-temporal relationships. VideoRoPE features low-frequency temporal allocation to mitigate periodic oscillations, a diagonal layout to maintain spatial symmetry, and adjustable temporal spacing to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at https://github.com/Wiselnn570/VideoRoPE{https://github.com/Wiselnn570/VideoRoPE}.

View Paper