InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng Yang, Chaoyou Fu, Xiang Li, Jian Yang, Ying Tai

2024-12-16

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Summary

This paper talks about InstanceCap, a new framework designed to improve how computers generate videos from text descriptions by creating more detailed and accurate captions.

What's the problem?

Generating videos from text descriptions can be challenging because current video captions often lack detail, leading to inaccuracies and unrealistic movements in the videos. This can result in videos that don't match the text well, making it hard for viewers to understand the intended message.

What's the solution?

InstanceCap introduces a method that creates detailed captions at the instance level, meaning it focuses on specific objects or actions in the video. It uses a special technique to break down videos into individual instances, enhancing the quality of the captions. The authors also created a large dataset of 22,000 video samples to train this system, allowing it to produce better results. By refining how captions are structured and ensuring they accurately describe the video content, InstanceCap significantly improves the quality of generated videos.

Why it matters?

This research is important because it enhances the ability of AI to create realistic and coherent videos based on text prompts. By providing more accurate and detailed captions, InstanceCap can lead to better applications in fields like entertainment, education, and marketing, where clear communication through video is essential.

Abstract

Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a 22K InstanceVid dataset is curated for training, and an enhancement pipeline that tailored to InstanceCap structure is proposed for inference. Experimental results demonstrate that our proposed InstanceCap significantly outperform previous models, ensuring high fidelity between captions and videos while reducing hallucinations.

View Paper