ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy

Jianwen Sun, Yukang Feng, Chuanhao Li, Fanrui Zhang, Zizhen Li, Jiaxin Ai, Sizhuo Zhou, Yu Dai, Shenglin Zhang, Kaipeng Zhang

2025-03-17

ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model
with Interleaved Multimodal Generation via Asymmetric Synergy

Summary

This paper introduces ARMOR, a new and efficient way to combine AI models for understanding and creating images and text together, particularly when they are mixed, like a picture with captions.

What's the problem?

Existing AI models that try to both understand and generate content (like images and text) at the same time need a lot of computing power and often have trouble generating text and images that are interwoven.

What's the solution?

ARMOR takes existing AI models that are good at understanding language and images and adds new features to them. These features let them create text and images together in a natural way, without needing a lot of extra resources. It's like giving an old model a new set of tools.

Why it matters?

This work matters because it makes it easier to create AI models that can both understand and generate complex content, which could be useful for things like creating educational materials or interactive stories.

Abstract

Unified models (UniMs) for multimodal understanding and generation have recently received much attention in the area of vision and language. Existing UniMs are designed to simultaneously learn both multimodal understanding and generation capabilities, demanding substantial computational resources, and often struggle to generate interleaved text-image. We present ARMOR, a resource-efficient and pure autoregressive framework that achieves both understanding and generation by fine-tuning existing multimodal large language models (MLLMs). Specifically, ARMOR extends existing MLLMs from three perspectives: (1) For model architecture, an asymmetric encoder-decoder architecture with a forward-switching mechanism is introduced to unify embedding space integrating textual and visual modalities for enabling natural text-image interleaved generation with minimal computational overhead. (2) For training data, a meticulously curated, high-quality interleaved dataset is collected for fine-tuning MLLMs. (3) For the training algorithm, we propose a ``what or how to generate" algorithm to empower existing MLLMs with multimodal generation capabilities while preserving their multimodal understanding capabilities, through three progressive training stages based on the collected dataset. Experimental results demonstrate that ARMOR upgrades existing MLLMs to UniMs with promising image generation capabilities, using limited training resources. Our code will be released soon at https://armor.github.io.

View Paper