Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, Steven Hoi
2025-12-05
Summary
This paper introduces Live Avatar, a new system for creating realistic, moving avatars from audio in real-time using a powerful type of AI called a diffusion model.
What's the problem?
Current AI methods for generating videos, especially those driven by audio like for avatars, are slow and often have issues with consistency over time. They process each frame one after another, which takes a lot of computing power and can lead to the avatar’s appearance changing unexpectedly or looking glitchy during longer videos. This makes them impractical for things like live streaming or real-time applications.
What's the solution?
The researchers tackled this problem with a combination of techniques. First, they sped up the process by distributing the work of creating each frame across multiple computer graphics cards (GPUs) simultaneously, instead of waiting for each frame to finish before starting the next. This is called Timestep-forcing Pipeline Parallelism. Second, they used a 'rolling reference' system where the avatar’s appearance is constantly checked against a stored image to maintain consistency and prevent drifting. Finally, they refined the AI model itself to make it better at generating a continuous stream of video without losing quality.
Why it matters?
This work is important because it demonstrates a way to use advanced AI for real-time video generation, specifically for creating lifelike avatars. Achieving 20 frames per second with this level of detail opens up possibilities for applications like virtual meetings, gaming, and personalized digital assistants where a realistic and responsive avatar is needed. It shows a new path for deploying these complex AI models in practical, long-running video applications.
Abstract
Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.