HunyuanCustom introduces an image-text fusion module based on LLaVA to facilitate interaction between images and text, allowing identity information from images to be effectively integrated into textual descriptions. Additionally, an image ID enhancement module is proposed, which concatenates image information along the temporal axis and leverages the video model's efficient temporal modeling ability to enhance subject identity throughout the video. This enables the generation of high-quality videos with precise control over image, audio, and video conditions.
HunyuanCustom also supports audio-driven and video-driven video customization, allowing for more flexible and controllable audio-driven human animation and video-driven video generation. The framework can replace or add specified objects in a video with the ID specified in an image, enabling a wide range of applications in video editing, animation, and virtual reality. With its advanced features and capabilities, HunyuanCustom has the potential to revolutionize the field of video generation and editing.