Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

SII-GAIR, Sand. ai, Ethan Chern, Hansi Teng, Hanwen Sun, Hao Wang, Hong Pan, Hongyu Jia, Jiadi Su, Jin Li, Junjie Yu, Lijie Liu, Lingzhi Li, Lyumanshan Ye, Min Hu, Qiangang Wang, Quanwei Qi, Steffi Chern, Tao Bu, Taoran Wang, Teren Xu, Tianning Zhang

2026-03-24

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

Summary

This paper introduces daVinci-MagiHuman, a new open-source computer program that can create realistic videos with synchronized speech from just text instructions. It's designed to be really good at generating content featuring people, like making a character talk and move naturally.

What's the problem?

Creating videos with realistic audio and visuals that perfectly match each other is really hard. Existing methods often require complicated setups with separate parts for video and audio, making them difficult to build and improve. Also, many models struggle to create natural-looking human movements and speech, or they don't support multiple languages.

What's the solution?

The researchers built daVinci-MagiHuman using a single 'brain' – a type of neural network called a Transformer – that processes text, video, and audio all at once. This simplifies the process and makes it easier to train. They also used clever techniques like model distillation and super-resolution to make the video generation faster and more efficient, allowing it to create a short video in just a couple of seconds on a powerful computer. The model can generate speech in several languages including English, Chinese, and German.

Why it matters?

This work is important because it provides a freely available tool for anyone to create high-quality, human-centric videos from text. It’s a step towards making video creation more accessible and could be used for things like creating educational content, personalized videos, or even assisting in filmmaking. The fact that it performs better than other open-source models and supports multiple languages makes it a significant advancement in the field.

Abstract

We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.

View Paper