The approach pretrains a hierarchical autoencoder that compresses each frame into multiple token levels, then generates video through a coarse-to-fine rollout. This lets the model preserve longer-term structure under a tighter token budget than a flat latent representation.
MilliVid is useful for video-generation researchers working on long clips, scene consistency, and memory-efficient generation. The project page links to arXiv and code and includes a direct project video asset.


