Motif-Video 2B: Technical Report

Junghwan Lim, Wai Ting Cheung, Minsu Ha, Beomgyu Kim, Taewhan Kim, Haesol Lee, Dongpin Oh, Jeesoo Lee, Taehyun Kim, Minjae Kim, Sungmin Lee, Hyeyeon Cho, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Dongseok Kim, Jangwoong Kim, Youngrok Kim, Hyukjin Kweon, Hongjoo Lee

2026-04-20

Summary

This paper explores whether it's possible to create high-quality videos from text descriptions without needing enormous amounts of data and computing power, something that usually plagues video generation models.

What's the problem?

Currently, building good text-to-video models requires huge datasets, models with billions of parameters, and a ton of processing time. The challenge is that getting the video to accurately reflect the text, stay consistent throughout, and show fine details all compete with each other when handled in the same way. Simply making the model bigger doesn't necessarily solve these issues efficiently.

What's the solution?

The researchers developed a model called Motif-Video 2B that tackles this by cleverly organizing the model's structure. Instead of just increasing size, they separated the tasks of understanding the text prompt, maintaining consistency over time, and adding detailed visuals into different parts of the model. They also used a technique called 'shared cross-attention' to help the model stay focused on the text even with long video sequences, and a special training method to make the most of limited computing resources.

Why it matters?

This work is important because it demonstrates that you can achieve impressive video quality with a much smaller and more efficient model. Motif-Video 2B outperforms a much larger model (Wan2.1 14B) while using significantly fewer resources, suggesting that smart design and training techniques can be just as, or even more, effective than simply scaling up model size. This could make high-quality video generation more accessible and affordable.

Abstract

Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76\%, surpassing Wan2.1 14B while using 7times fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.

View Paper