BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu

2025-05-15

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture,
Training and Dataset

Summary

This paper talks about BLIP3-o, a new family of AI models that can both understand and create images, using a special design called a diffusion transformer, and all the details about how it works and the data it uses are fully open to the public.

What's the problem?

The problem is that most AI models are either good at understanding images or generating them, but not both at the same time, and it's also rare to have all the technical details and training data available for everyone to use and learn from.

What's the solution?

The researchers built a unified model framework that combines the strengths of diffusion transformers for creating images with strong image understanding abilities. They made everything about the model, its design, training process, and dataset completely open so anyone can study, use, or improve it.

Why it matters?

This matters because it pushes AI to be more versatile, allowing one model to handle a wider range of tasks involving images, and by being fully open, it helps students, researchers, and developers everywhere to learn from and build on this work.

Abstract

A diffusion transformer is used in a unified multimodal model framework to improve image generation while maintaining image understanding capabilities.

View Paper