Seedance 2.0: Advancing Video Generation for World Complexity

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, Yujiao Du, Haojie Duanmu, Junliang Fan, Jiarui Fang, Jing Fang, Zetao Fang

2026-04-16

Seedance 2.0: Advancing Video Generation for World Complexity

Summary

This paper introduces Seedance 2.0, a new artificial intelligence model that can create both audio and video from different kinds of inputs like text, images, audio, and other videos.

What's the problem?

Existing AI models for creating audio and video often struggle to handle multiple types of inputs at once, or they don't produce high-quality results consistently. Previous versions of Seedance, while good, had room for improvement in terms of efficiency, the range of inputs they could use, and the overall quality of the generated content.

What's the solution?

The creators of Seedance 2.0 built a completely new system that’s designed to work with all sorts of inputs – text, images, audio, and video – simultaneously. It’s more efficient and can generate videos up to 15 seconds long at resolutions of 480p or 720p. They also created a faster version of the model for situations where quick results are needed, and it can handle multiple reference videos, images, and audio clips at the same time.

Why it matters?

Seedance 2.0 is important because it represents a significant step forward in AI’s ability to create realistic and high-quality audio and video content. This could have a big impact on fields like filmmaking, content creation, and even everyday communication, making it easier for people to express their ideas and bring them to life.

Abstract

Seedance 2.0 is a new native multi-modal audio-video generation model, officially released in China in early February 2026. Compared with its predecessors, Seedance 1.0 and 1.5 Pro, Seedance 2.0 adopts a unified, highly efficient, and large-scale architecture for multi-modal audio-video joint generation. This allows it to support four input modalities: text, image, audio, and video, by integrating one of the most comprehensive suites of multi-modal content reference and editing capabilities available in the industry to date. It delivers substantial, well-rounded improvements across all key sub-dimensions of video and audio generation. In both expert evaluations and public user tests, the model has demonstrated performance on par with the leading levels in the field. Seedance 2.0 supports direct generation of audio-video content with durations ranging from 4 to 15 seconds, with native output resolutions of 480p and 720p. For multi-modal inputs as reference, its current open platform supports up to 3 video clips, 9 images, and 3 audio clips. In addition, we provide Seedance 2.0 Fast version, an accelerated variant of Seedance 2.0 designed to boost generation speed for low-latency scenarios. Seedance 2.0 has delivered significant improvements to its foundational generation capabilities and multi-modal generation performance, bringing an enhanced creative experience for end users.

View Paper