Vision Bridge Transformer at Scale

Zhenxiong Tan, Zeqing Wang, Xingyi Yang, Songhua Liu, Xinchao Wang

2025-12-01

Summary

This paper introduces a new type of model called Vision Bridge Transformer, or ViBT, which is designed to change images and videos based on instructions. It's a different approach than many existing models that create images from scratch.

What's the problem?

Current methods for changing images or videos, like diffusion models, can be slow and inefficient because they start with random noise and gradually refine it into the desired output. This process takes time and a lot of computing power. The challenge was to find a faster and more direct way to translate one image or video into another.

What's the solution?

The researchers created ViBT, which doesn't start from noise. Instead, it directly learns the 'pathway' or transformation needed to go from an input image or video to the desired output. They built very large versions of this model, with up to 20 billion parameters, and used a Transformer architecture – a common design in modern AI – to handle the complexity. They also developed a new training technique to make the model more stable and reliable.

Why it matters?

This work shows that Bridge Models, when scaled up significantly, can be very effective for tasks like editing images based on text instructions and creating complex video transformations. This could lead to faster and more efficient tools for image and video editing, potentially impacting fields like content creation and visual effects.

Abstract

We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.

View Paper