< Explain other AI papers

LPM 1.0: Video-based Character Performance Model

Ailing Zeng, Casper Yang, Chauncey Ge, Eddie Zhang, Garvey Xu, Gavin Lin, Gilbert Gu, Jeremy Pi, Leo Li, Mingyi Shi, Sheng Bi, Steven Tang, Thorn Hang, Tobey Guo, Vincent Li, Xin Tong, Yikang Li, Yuchen Sun, Yue, Zhao, Yuhan Lu, Yuwei Li

2026-04-10

LPM 1.0: Video-based Character Performance Model

Summary

This paper introduces a new AI model, called LPM 1.0, that can create realistic and responsive video performances of a single person, specifically focusing on conversations. Think of it as a way to make digital characters feel more alive and engaging.

What's the problem?

Creating believable digital characters is hard. Existing AI models struggle to balance three key things: making the character expressive and show emotion, responding in real-time, and keeping the character’s identity consistent over a long conversation. It’s a trade-off – you can usually only maximize two of these at once, which the authors call the 'performance trilemma'. Conversations are especially challenging because characters need to react to what’s being said, listen, speak, and maintain their personality all at the same time.

What's the solution?

The researchers tackled this by first creating a large, high-quality dataset of audio and video recordings of people having conversations. They then built a powerful AI model, a 17 billion parameter 'Diffusion Transformer', which they call Base LPM, to learn from this data. This model is really good at controlling the character’s movements and making sure they stay consistent. To make it fast enough for real-time use, they 'distilled' the Base LPM into a smaller, faster model called Online LPM. This allows LPM 1.0 to generate videos of a character listening to you and responding, all in real-time, while keeping their identity intact.

Why it matters?

This work is important because it opens up possibilities for more realistic and engaging virtual characters. It could be used to create better conversational agents (like advanced chatbots with a visual presence), more lifelike characters for live streaming, or more believable non-player characters (NPCs) in video games. The researchers also created a new benchmark, LPM-Bench, to help evaluate and improve future character performance models.

Abstract

Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.