Scaling Zero-Shot Reference-to-Video Generation

Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, Xiao Han, Yuren Cong, Hang Li, Chuyan Zhu, Aditya Patel, Tao Xiang, Sen He

2025-12-09

Scaling Zero-Shot Reference-to-Video Generation

Summary

This paper introduces a new method, called Saber, for creating videos from text descriptions while making sure the video features a specific person or object from a given image. Essentially, it's about turning text and a picture into a short, matching video.

What's the problem?

Currently, making these kinds of videos requires a lot of specific training data – sets of images, videos, and text descriptions all linked together. Creating this data is expensive and time-consuming, making it hard to build systems that can generate videos from references easily. Existing methods struggle because they *need* this perfect, labeled data to work.

What's the solution?

Saber solves this by learning directly from videos and their descriptions *without* needing those image-video-text sets. It uses a clever training technique where parts of the video are hidden, forcing the system to learn how to connect the text to the visual elements and maintain the identity of the subject from the reference image. They also added techniques to reduce weird visual glitches that often happen when copying elements from the reference image into the video.

Why it matters?

This is important because Saber makes it much easier and cheaper to create videos from text and images. Because it doesn't rely on expensive labeled data, it can be scaled up more easily and works surprisingly well, even better than methods that *do* use that specialized data. This opens the door to more accessible and powerful video generation tools.

Abstract

Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.

View Paper