Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag

2025-12-02

Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

Summary

This paper introduces a new method, called infty-RoPE, to improve how AI generates long videos. It tackles issues with existing video generation models that limit how long videos can be, how well they follow instructions, and their ability to create smooth transitions between scenes.

What's the problem?

Current AI video generators struggle with a few key things. First, they have a limited 'memory' – they can only realistically generate videos up to a certain length because of how they track time within the video. Second, they can be slow to react to changes in instructions during video creation, making it hard to precisely control the action. Finally, they can't easily create dramatic cuts or transitions between different scenes within a single, continuous video.

What's the solution?

The researchers developed infty-RoPE, which is a set of three techniques that work together. Block-Relativistic RoPE changes how the AI understands time in the video, allowing it to generate videos of any length. KV Flush helps the AI quickly respond to new instructions by focusing on only the most recent parts of the video. RoPE Cut allows for intentional breaks in the flow of time, enabling the creation of scene transitions. Importantly, these techniques don't require retraining the original AI model.

Why it matters?

This work is important because it significantly improves the quality and control of AI-generated videos. By overcoming the limitations of previous methods, infty-RoPE opens the door to creating longer, more dynamic, and more cinematic videos with greater precision and ease. The fact that it doesn't require new training makes it easily applicable to existing video generation models, meaning we can see improvements quickly.

Abstract

Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce infty-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish infty-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that infty-RoPE consistently surpasses previous autoregressive models in overall VBench scores.

View Paper