Key Features

Uses hierarchical latent tokens for long-range video consistency.
Compresses frames with a hierarchical autoencoder before generation.
Generates videos through coarse-to-fine rollout.
Targets longer consistent videos under a limited transformer token budget.
Compares quality and consistency tradeoffs in video generation.
Focuses on long-memory autoregressive video generation.
Provides arXiv and public code links.
Includes a direct project demo video hosted on the page.

The approach pretrains a hierarchical autoencoder that compresses each frame into multiple token levels, then generates video through a coarse-to-fine rollout. This lets the model preserve longer-term structure under a tighter token budget than a flat latent representation.


MilliVid is useful for video-generation researchers working on long clips, scene consistency, and memory-efficient generation. The project page links to arXiv and code and includes a direct project video asset.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner
Zero to AI Engineer Program

Zero to AI Engineer

Skip the degree. Learn real-world AI skills used by AI researchers and engineers. Get certified in 8 weeks or less. No experience required.

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!