HunyuanCustom introduces an image-text fusion module based on LLaVA to facilitate interaction between images and text, allowing identity information from images to be effectively integrated into textual descriptions. Additionally, an image ID enhancement module is proposed, which concatenates image information along the temporal axis and leverages the video model's efficient temporal modeling ability to enhance subject identity throughout the video. This enables the generation of high-quality videos with precise control over image, audio, and video conditions.


HunyuanCustom also supports audio-driven and video-driven video customization, allowing for more flexible and controllable audio-driven human animation and video-driven video generation. The framework can replace or add specified objects in a video with the ID specified in an image, enabling a wide range of applications in video editing, animation, and virtual reality. With its advanced features and capabilities, HunyuanCustom has the potential to revolutionize the field of video generation and editing.

Key Features

Multi-modal customized video generation
Supports image, audio, video, and text conditions
Image-text fusion module based on LLaVA
Image ID enhancement module for subject consistency
Audio-driven and video-driven video customization
Precise control over image, audio, and video conditions
High-quality video generation
Supports a wide range of applications in video editing, animation, and virtual reality

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!