MAGREF: Masked Guidance for Any-Reference Video Generation

Yufan Deng, Xun Guo, Yuanyang Yin, Jacob Zhiyuan Fang, Yiding Yang, Yizhi Wang, Shenghai Yuan, Angtian Wang, Bo Liu, Haibin Huang, Chongyang Ma

2025-05-30

MAGREF: Masked Guidance for Any-Reference Video Generation

Summary

This paper talks about MAGREF, a new system that helps AI create videos where multiple people or objects from different sources are combined smoothly and realistically, even if the instructions come from both pictures and written text.

What's the problem?

The problem is that making videos with several different subjects, especially when they come from various images or descriptions, often leads to results that look messy or unrealistic, because the AI struggles to blend everything together in a natural way.

What's the solution?

The researchers developed MAGREF, which uses a special technique called masked guidance and dynamic masking. This means the AI can focus on the right parts of each reference image or text prompt at the right time, helping it create videos where all the subjects fit together and move smoothly.

Why it matters?

This is important because it makes it possible to create much more creative and high-quality videos using AI, which can be useful for movies, advertising, social media, and any project where you want to combine different people or things into a single, believable video.

Abstract

MAGREF is a unified framework for video generation that uses masked guidance and dynamic masking for coherent multi-subject synthesis from diverse references and text prompts.

View Paper