KlingAvatar 2.0 Technical Report

Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Yuan Gao, Kang He, Jingyun Hua, Boyuan Jiang, Mingming Lao, Xiaohan Li, Hui Liu, Jiwen Liu, Xiaoqiang Liu, Yuan Liu, Shun Lu, Yongsen Mao, Yingchao Shao, Huafeng Shi, Xiaoyu Shi, Peiqin Sun, Songlin Tang

2025-12-16

Summary

This paper introduces a new system called KlingAvatar 2.0 for creating realistic, long videos of people (avatars) based on text instructions.

What's the problem?

Existing avatar video generation models struggle to create long, high-quality videos. As the video gets longer, the quality tends to decrease, the movements become unnatural (temporal drifting), and the video doesn't always accurately follow the instructions given. It's hard to maintain consistency and realism over time.

What's the solution?

KlingAvatar 2.0 tackles this by first creating a rough, low-resolution version of the entire video to plan out the key movements and overall story. Then, it gradually improves the quality, focusing on short segments at a time, making sure each segment flows smoothly into the next. To better understand what the user wants, the system uses multiple 'expert' AI models that analyze the text instructions and even have a 'dialogue' to clarify the user's intent. It also has a system to specifically refine what the avatar *shouldn't* do, improving accuracy. Finally, it can control multiple characters in a video, each with their own unique identity.

Why it matters?

This research is important because it makes it possible to generate much longer and more realistic avatar videos than before. This has potential applications in areas like filmmaking, virtual reality, and personalized content creation, allowing for more engaging and immersive experiences.

Abstract

Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.

View Paper