HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, Linus, Patrol, Peizhen Zhang, Peng Chen, Penghao Zhao, Qi Tian, Songtao Liu, Weijie Kong, Weiyan Wang, Xiao He, Xin Li, Xinchi Deng

2025-11-25

Summary

This paper introduces HunyuanVideo 1.5, a new open-source model for creating videos from text or images. It's designed to be relatively small and efficient, meaning it can run on standard computer hardware without needing super powerful equipment, while still producing high-quality videos.

What's the problem?

Creating realistic and coherent videos from text or images is a really hard problem for computers. Existing models often require massive amounts of computing power and data, making them inaccessible to many researchers and creators. There was a need for a video generation model that could achieve good results without needing huge resources.

What's the solution?

The researchers tackled this by carefully selecting the data used to train the model and building a new architecture called DiT with a special attention mechanism (SSTA). They also improved how the model understands text, especially characters and symbols, and used a two-step training process. Finally, they added a component to upscale the video resolution for better quality. All of these pieces work together in a single system to generate videos from text or images.

Why it matters?

HunyuanVideo 1.5 is important because it makes high-quality video generation more accessible. By releasing the model and its code publicly, the researchers are allowing more people to experiment with and build upon this technology, potentially leading to new creative tools and research advancements in the field of AI-powered video creation.

Abstract

We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions.Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models. By releasing the code and model weights, we provide the community with a high-performance foundation that lowers the barrier to video creation and research, making advanced video generation accessible to a broader audience. All open-source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.

View Paper