TokensGen: Harnessing Condensed Tokens for Long Video Generation

Wenqi Ouyang, Zeqi Xiao, Danni Yang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, Xingang Pan

2025-07-22

TokensGen: Harnessing Condensed Tokens for Long Video Generation

Summary

This paper talks about TokensGen, a two-stage system designed to generate long, smooth, and consistent videos using condensed tokens that represent short clips.

What's the problem?

The problem is that current models are good at creating short video clips, but when you try to make longer videos, they struggle with keeping details consistent, controlling the content properly, and making sure the clips flow smoothly together.

What's the solution?

The authors created a framework where the first part generates short clips guided by rich semantic tokens extracted from video segments, capturing details and motion. The second part uses a transformer model to generate all the tokens for a longer video at once, ensuring the whole video stays consistent. They also use a method to connect clips seamlessly during video playback.

Why it matters?

This matters because it helps AI generate long, high-quality videos that make sense visually and tell coherent stories or scenes, opening new possibilities for movies, virtual reality, and other creative video applications.

Abstract

TokensGen uses a two-stage framework with condensed tokens to generate long, consistent videos by addressing semantic control, long-term consistency, and smooth transitions.

View Paper