LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, Pengfei Liu
2025-12-30
Summary
This paper focuses on making video generation with AI happen in real-time, allowing for smooth and natural interactions between people and the AI. It's about creating videos from different types of inputs like text, images, and audio, and doing it quickly enough for a conversation.
What's the problem?
Currently, creating videos with AI using diffusion models is slow. These models work by gradually refining a video, looking at all frames at once, which takes a lot of processing power and time. While some methods speed things up, they often don't work well when using multiple types of input (like text *and* audio) and can result in videos with glitches or poor quality. Existing techniques also don't prioritize the quality of the initial instructions given to the AI, leading to less realistic results.
What's the solution?
The researchers improved a technique called 'distillation' to make the video generation process faster and better. They focused on making sure the AI pays close attention to the quality of the initial text, image, or audio instructions it receives. They also carefully adjusted how the AI learns to generate videos step-by-step. This resulted in a model that generates videos 20 times faster than previous methods, while maintaining similar quality. They then built a system called LiveTalk that uses this improved model to create real-time, interactive avatars.
Why it matters?
This work is important because it allows for truly interactive AI systems. Imagine having a conversation with an AI avatar that responds with realistic video in real-time, based on what you say and show it. This technology could revolutionize things like video conferencing, virtual assistants, and entertainment, making interactions with AI much more natural and engaging. Their system, LiveTalk, even performs better than current leading AI video generators like Sora and Veo in terms of how well the video continues to make sense over a longer conversation.
Abstract
Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.