4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Roman Bachmann, Oğuzhan Fatih Kar, David Mizrahi, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir

2024-06-14

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Summary

This paper discusses 4M-21, a new advanced model that can handle a wide variety of tasks and input types, such as images and text. It aims to improve how well models can understand and generate different kinds of data all at once.

What's the problem?

Current models that work with multiple types of data (like images and text) often struggle because they are only trained on a limited number of tasks and input types. This means they can't perform as well when given new or diverse inputs, which limits their usefulness in real-world applications.

What's the solution?

To solve this problem, the authors developed the 4M-21 model, which is trained on a much larger set of diverse data types, including semantic and geometric information, as well as various features from other advanced models. They also used a technique called discrete tokenization to effectively manage different types of data, allowing the model to learn from them more efficiently. This approach enables 4M-21 to perform at least three times more tasks than previous models without losing performance.

Why it matters?

This research is important because it sets a new standard for how multimodal models can be developed. By allowing a single model to handle many different tasks and types of data, 4M-21 has the potential to improve applications in areas like image generation, data retrieval, and interactive AI systems. The open-source nature of the model also encourages further research and development in the field.

Abstract

Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on. In this paper, we expand upon the capabilities of them by training a single model on tens of highly diverse modalities and by performing co-training on large-scale multimodal datasets and text corpora. This includes training on several semantic and geometric modalities, feature maps from recent state of the art models like DINOv2 and ImageBind, pseudo labels of specialist models like SAM and 4DHumans, and a range of new modalities that allow for novel ways to interact with the model and steer the generation, for example image metadata or color palettes. A crucial step in this process is performing discrete tokenization on various modalities, whether they are image-like, neural network feature maps, vectors, structured data like instance segmentation or human poses, or data that can be represented as text. Through this, we expand on the out-of-the-box capabilities of multimodal models and specifically show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance. This enables more fine-grained and controllable multimodal generation capabilities and allows us to study the distillation of models trained on diverse data and objectives into a unified model. We successfully scale the training to a three billion parameter model using tens of modalities and different datasets. The resulting models and training code are open sourced at 4m.epfl.ch.

View Paper