EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
Xin He, Longhui Wei, Jianbo Ouyang, Lingxi Xie, Qi Tian
2025-12-08
Summary
This paper introduces EMMA, a new system designed to handle different kinds of tasks involving both images and text, like understanding what's in a picture, creating images from text, and editing existing images.
What's the problem?
Existing systems that try to do all these things – understand, generate, and edit images and text – often struggle with efficiency and performance. They either require a huge amount of computing power or don't perform as well as specialized systems that only focus on one task. A key issue is managing the amount of information needed to process images, which is much larger than text, creating an imbalance during training.
What's the solution?
The researchers tackled this by building EMMA with several key ideas. First, they use a special type of data compression to significantly reduce the size of image information, making it easier to handle alongside text. Instead of simply combining image and text data piece by piece, they combine them in a smarter way, focusing on the important features of the images. They also designed a network that allows different parts of the system to learn from each other while still being good at their specific jobs. Finally, they added a component that helps the system better understand the details within images without adding a lot of extra complexity.
Why it matters?
EMMA is important because it shows a way to build a single system that can perform multiple image and text tasks effectively and efficiently. It outperforms other similar systems, even those that are much larger, and achieves results comparable to specialized systems. This work provides a strong foundation for future development in creating more versatile and powerful multimodal AI.
Abstract
We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.