Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, Xuyan Chi, Jian Cong, Jing Cui, Qinpeng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Dong Guo

2025-12-19

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Summary

This paper introduces Seedance 1.5 pro, a new artificial intelligence model designed to create both video and audio together, not as separate steps. It's a significant upgrade in the field of generating realistic and synchronized audio-visual content.

What's the problem?

Creating videos with perfectly matched lip movements to spoken words, and making sure the audio and video feel naturally connected, is really hard for computers. Existing methods often struggle with synchronization, realistic movement, and creating compelling narratives. Generating high-quality audio and video *together* is more challenging than doing them separately.

What's the solution?

The researchers built Seedance 1.5 pro using a special architecture called a 'Dual-Branch Diffusion Transformer.' This basically means it has two parts working together to handle audio and video. They also used a careful process to prepare the data the model learns from, and then 'fine-tuned' it using feedback from people to make it even better. Finally, they developed techniques to make the model run much faster, over ten times quicker than before.

Why it matters?

Seedance 1.5 pro is important because it can create professional-quality videos with realistic lip-syncing in multiple languages, dynamic camera movements, and stories that make sense. This opens up possibilities for easier and more efficient content creation for things like movies, games, and online videos, potentially making it accessible to more people.

Abstract

Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.

View Paper