Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Maria Kovaleva, Nikolai Vaulin, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, Alexander Varlamov, Dmitrii Mikhailov, Vladimir Polovnikov, Andrey Shutkin, Ilya Vasiliev, Julia Agafonova, Anastasiia Kargapoltseva, Anna Dmitrienko, Anastasia Maltseva, Anna Averchenkova

2025-11-20

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Summary

This paper introduces Kandinsky 5.0, a new set of AI models designed to create incredibly detailed images and short videos from text descriptions.

What's the problem?

Creating high-quality images and videos from text is really hard for computers. Existing models often struggle with detail, speed, or both. It's also difficult to make these models widely available for others to build upon and improve.

What's the solution?

The researchers developed three versions of Kandinsky 5.0, each with different strengths. They have a smaller, faster model for quick image generation, a lightweight video model, and a larger, more powerful model for top-notch video quality. They also focused on carefully preparing the data used to train the AI, and used advanced training techniques to improve the results. Finally, they optimized the models to run efficiently.

Why it matters?

Kandinsky 5.0 is important because it pushes the boundaries of what's possible with AI image and video generation. By releasing the code and training data publicly, the researchers hope to help other scientists and developers create even better generative AI tools, making this technology more accessible to everyone.

Abstract

This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.

View Paper