Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao, Yingchao Shao, Hui Liu, Xiaohan Li, Ming Chen, Xiaoqiang Liu, Yu-Shen Liu, Pengfei Wan

2025-09-12

Summary

This paper introduces a new system called Kling-Avatar that creates realistic videos of avatars speaking, driven by audio and text instructions.

What's the problem?

Current methods for making avatars talk along with audio are good at getting the lip movements right, but they don't really understand *what* the avatar is supposed to be saying or feeling. They focus on matching sounds and visuals without considering the overall meaning or story, leading to videos that feel robotic or lack emotion and coherence.

What's the solution?

The researchers developed Kling-Avatar, which works in two steps. First, a powerful AI 'director' analyzes the audio and text instructions to create a plan – a 'blueprint' – for the video, deciding on the avatar’s movements and emotions. Then, another part of the system uses this blueprint to actually generate the video, making sure the details are clear and the overall intent of the instructions is followed. They also designed it to create longer videos quickly and reliably by working on smaller parts at the same time.

Why it matters?

This work is important because it moves beyond simply making avatars *look* like they're talking to making them actually *communicate* effectively. This could be really useful for things like creating digital humans for livestreaming, vlogging, or other applications where a realistic and expressive virtual person is needed, and it sets a new standard for how well these avatars can perform.

Abstract

Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.

View Paper