ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

Ethan Chern, Jiadi Su, Yan Ma, Pengfei Liu

2024-07-09

ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

Summary

This paper talks about Anole, a new open-source model designed to generate text and images together in a more efficient and integrated way, overcoming limitations faced by previous models.

What's the problem?

The main problem is that many existing multimodal models struggle with integrating text and image generation. They often require additional tools (adapters) to connect visual information with language models, which can complicate the process. Additionally, some models can only handle one type of data at a time (single-modal), and those that do support both text and images often rely on separate systems for each, making them less efficient.

What's the solution?

To solve these issues, the authors developed Anole, which is an autoregressive model that can generate text and images in a unified manner without needing separate components. This means it can create coherent outputs that combine both images and text more naturally. Anole is built on Meta AI's Chameleon model and uses a new fine-tuning method that requires less data and fewer parameters while still achieving high-quality results. The authors have made the model, training framework, and data available for others to use.

Why it matters?

This research is important because it simplifies the process of generating multimodal content (text and images together), making it easier for developers and researchers to create applications that require both types of data. By improving how these models work together, Anole could enhance various fields such as content creation, advertising, and education, where combining visuals with text is essential.

Abstract

Previous open-source large multimodal models (LMMs) have faced several limitations: (1) they often lack native integration, requiring adapters to align visual representations with pre-trained large language models (LLMs); (2) many are restricted to single-modal generation; (3) while some support multimodal generation, they rely on separate diffusion models for visual modeling and generation. To mitigate these limitations, we present Anole, an open, autoregressive, native large multimodal model for interleaved image-text generation. We build Anole from Meta AI's Chameleon, adopting an innovative fine-tuning strategy that is both data-efficient and parameter-efficient. Anole demonstrates high-quality, coherent multimodal generation capabilities. We have open-sourced our model, training framework, and instruction tuning data.

View Paper