PersonaLive! Expressive Portrait Image Animation for Live Streaming

Zhiyuan Li, Chi-Man Pun, Chen Fang, Jue Wang, Xiaodong Cun

2025-12-15

PersonaLive! Expressive Portrait Image Animation for Live Streaming

Summary

This paper introduces PersonaLive, a new system for creating realistic, animated portraits in real-time, specifically designed for applications like live streaming.

What's the problem?

Existing methods for animating portraits using diffusion models are really good at making the animations look high-quality and expressive, but they are too slow to be used in live situations like streaming because of how long it takes to generate each frame. This limits where these cool animations can actually be used.

What's the solution?

The researchers tackled this problem in a few key ways. First, they used a smart combination of data representing facial features to give precise control over the animation. Second, they streamlined the animation process to remove unnecessary steps, making it much faster. Finally, they developed a way to generate the animation in small pieces, almost like streaming a video, and used past frames to keep the animation consistent and smooth, reducing delays.

Why it matters?

PersonaLive is important because it significantly speeds up diffusion-based portrait animation – up to 7 to 22 times faster than previous methods – while still maintaining high quality. This breakthrough makes real-time portrait animation practical for live streaming and other applications where speed is crucial, opening up new possibilities for interactive and personalized video experiences.

Abstract

Current diffusion-based portrait animation models predominantly focus on enhancing visual quality and expression realism, while overlooking generation latency and real-time performance, which restricts their application range in the live streaming scenario. We propose PersonaLive, a novel diffusion-based framework towards streaming real-time portrait animation with multi-stage training recipes. Specifically, we first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control. Then, a fewer-step appearance distillation strategy is proposed to eliminate appearance redundancy in the denoising process, greatly improving inference efficiency. Finally, we introduce an autoregressive micro-chunk streaming generation paradigm equipped with a sliding training strategy and a historical keyframe mechanism to enable low-latency and stable long-term video generation. Extensive experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.

View Paper