Goku: Flow Based Video Generative Foundation Models
Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu
2025-02-10
Summary
This paper introduces Goku AI, a new video creation tool that uses smart math (called rectified flow Transformers) to make super realistic videos from text or pictures. It works better than older methods by combining image and video generation into one system.
What's the problem?
Existing AI tools struggle to make videos that look smooth and natural. They often create choppy movements, weird faces, or scenes that don't match the text description properly. It's especially hard to keep things consistent when making longer videos or switching between different types of content.
What's the solution?
The team built Goku using a special math approach (rectified flow) that helps the AI plan smoother video transitions. They trained it on massive amounts of high-quality pictures and videos, and designed the system to handle both images and videos together instead of separately. They also created better ways to filter training data and optimize the computer processing.
Why it matters?
This matters because it helps creators make professional-looking videos faster and cheaper - imagine turning a product photo into a TV commercial instantly. Better AI video tools could transform movies, advertising, education, and social media while making content creation accessible to more people.
Abstract
This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models.