BLIP3o-NEXT: Next Frontier of Native Image Generation

Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, Tianyi Zhou, Junnan Li, Silvio Savarese, Caiming Xiong, Ran Xu

2025-10-20

BLIP3o-NEXT: Next Frontier of Native Image Generation

Summary

This paper introduces BLIP3o-NEXT, a new and completely open-source AI model that's really good at creating images from text and editing existing images, all within one system.

What's the problem?

Creating realistic and coherent images from text descriptions is hard, and making edits to images based on instructions is even harder. Existing models often struggle with either generating fine details or truly understanding what edits a user wants. Also, it wasn't entirely clear what aspects of a model's design were *most* important for good image generation.

What's the solution?

The researchers found that the overall structure of the model isn't as crucial as making sure it's efficient and fast. They combined two different types of AI models: one that's good at understanding instructions and reasoning (autoregressive model) and another that's excellent at creating detailed images (diffusion model). The first model generates a basic 'blueprint' for the image, and the second model fills in all the details. They also improved the model by training it with better data and using a technique called reinforcement learning to refine its image creation skills.

Why it matters?

This work is important because it provides a powerful, freely available tool for anyone to generate and edit images with AI. By figuring out what really matters in image generation, they’ve paved the way for even better and more accessible AI art tools in the future, and the combined approach leads to more realistic and consistent results than previous methods.

Abstract

We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.

View Paper