VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, Xiaojuan Qi

2025-12-11

VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

Summary

This paper introduces a new method, VideoSSM, for creating long, coherent videos using a technique called autoregressive diffusion. It focuses on improving the quality and consistency of videos generated over extended periods, like several minutes long.

What's the problem?

Generating long videos frame by frame is tricky because small errors in each frame can build up over time, leading to issues like shaky motion, the video drifting off-topic, or repeating the same content over and over. Existing methods struggle to maintain a consistent storyline and realistic movement throughout a lengthy video.

What's the solution?

The researchers tackled this by giving the video generation process a 'memory'. They combined autoregressive diffusion with something called a 'state-space model'. Think of the state-space model as a long-term memory that remembers the overall scene and keeps things consistent, while a smaller 'context window' acts as short-term memory, focusing on the details of the current moment and motion. This combination allows the video to stay on track without getting stuck in loops or losing its overall theme.

Why it matters?

This work is important because it makes it possible to create much longer and more realistic videos automatically. It’s a step towards being able to generate entire movies or interactive experiences where the video responds to your commands, all while maintaining a consistent and believable visual experience. It also offers a more efficient way to generate these videos, scaling better with longer durations.

Abstract

Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally, yet maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition. We approach this problem from a memory perspective, treating video synthesis as a recurrent dynamical process that requires coordinated short- and long-term context. We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory. The state-space model (SSM) serves as an evolving global memory of scene dynamics across the entire sequence, while a context window provides local memory for motion cues and fine details. This hybrid design preserves global consistency without frozen, repetitive patterns, supports prompt-adaptive interaction, and scales in linear time with sequence length. Experiments on short- and long-range benchmarks demonstrate state-of-the-art temporal consistency and motion stability among autoregressive video generator especially at minute-scale horizons, enabling content diversity and interactive prompt-based control, thereby establishing a scalable, memory-aware framework for long video generation.

View Paper