SemanticGen: Video Generation in Semantic Space
Jianhong Bai, Xiaoshi Wu, Xintao Wang, Fu Xiao, Yuanxing Zhang, Qinghe Wang, Xiaoyu Shi, Menghan Xia, Zuozhu Liu, Haoji Hu, Pengfei Wan, Kun Gai
2025-12-24
Summary
This paper introduces a new method called SemanticGen for creating videos, aiming to improve upon existing techniques that can be slow and require a lot of computing power, especially when making longer videos.
What's the problem?
Current video generation models work by learning the patterns within the 'hidden code' of videos (called latents) and then turning that code into actual pixels you see. While this can produce good results, it takes a long time to train and generate videos, and it becomes incredibly demanding on computers when you want to create videos that are more than just a few seconds long. Essentially, these models try to figure out *everything* about the video all at once, which is inefficient.
What's the solution?
SemanticGen tackles this by generating videos in two steps, starting with the 'big picture' ideas. First, it uses a process called diffusion to create a simplified, high-level representation of the video's content – think of it like a storyboard. This focuses on the overall layout and what's happening. Then, it uses diffusion *again*, but this time to fill in the details, turning that storyboard into a full video by adding finer details based on the initial semantic features. By planning the video's structure first, it avoids trying to model every single pixel at the beginning, making the process faster and more efficient.
Why it matters?
This research is important because it offers a way to generate high-quality videos much more quickly and with less computing power than previous methods. This is especially crucial for creating longer videos, opening up possibilities for applications like movie making, content creation, and simulations where generating realistic and lengthy video sequences is essential.
Abstract
State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.