Architecture Decoupling Is Not All You Need For Unified Multimodal Model
Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, Peng Pei, Xunliang Cai, Hongsheng Li
2025-12-01
Summary
This research focuses on building AI models that can both understand images and create new ones, moving closer to the goal of artificial general intelligence (AGI).
What's the problem?
A major hurdle in creating these 'unified' models is that the skills needed for understanding images are different from those needed for generating them. Training a single model to do both well often leads to conflicts, where improving one skill hurts the other. Some researchers try to fix this by building models with separate parts for understanding and generating, but this can make the model lose its ability to seamlessly combine these skills – the whole point of a unified model.
What's the solution?
The researchers investigated *why* separating the model's functions helps reduce these conflicts. They found that separation forces the model to develop distinct ways of paying attention to different parts of an image depending on whether it's trying to understand it or create something new. Based on this, they created a new training technique called Attention Interaction Alignment (AIA). AIA doesn't separate the model; instead, it encourages the model to *learn* those task-specific attention patterns directly, improving both understanding and generation without needing a complicated, divided structure. They tested this on existing models and saw improvements in both areas.
Why it matters?
This work is important because it offers a way to build powerful, unified AI models that can both understand and generate images *without* sacrificing the benefits of having a single, integrated system. This is a step forward in creating more versatile and capable AI, bringing us closer to AGI.
Abstract
Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of model decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling alleviates conflicts by studying the cross-modal attention behavior of models. We observe that model decoupling essentially drives models toward task-specific multimodal interaction patterns, as seen in Qwen-VL and HunyuanImage, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns Task-Specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.