LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
Zeyu Wang, Zilong Chen, Chenhui Gou, Feng Li, Chaorui Deng, Deyao Zhu, Kunchang Li, Weihao Yu, Haoqin Tu, Haoqi Fan, Cihang Xie
2025-10-28
Summary
This paper introduces a new way to build powerful AI models that can work with different types of data, like text and images, without needing to train them completely from scratch.
What's the problem?
Creating these advanced AI models that handle multiple types of data is usually incredibly expensive and requires a huge amount of computing power and data because they are typically built from the ground up. This limits who can develop and use these technologies.
What's the solution?
The researchers found a way to combine existing, already-trained AI models – some good at understanding information and others good at creating it – and connect them using a special technique called multimodal self-attention. This technique allows the models to share information effectively without losing the strengths they already had. They only needed to train the combined model with a relatively small amount of data, about 35 billion pieces of information, to achieve impressive results.
Why it matters?
This research is important because it makes it much more feasible for researchers and developers to create powerful multimodal AI models without needing massive resources. By releasing their code and models, they hope to encourage further innovation in this field and allow more people to build on their work, leading to more accessible and advanced AI technologies.
Abstract
Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.