Beyond Language Modeling: An Exploration of Multimodal Pretraining

Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, Naila Murray, Marjan Ghazvininejad, Mike Lewis, Nicolas Ballas, Amir Bar, Michael Rabbat, Jakob Verbeek, Luke Zettlemoyer, Koustuv Sinha, Yann LeCun, Saining Xie

2026-03-04

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Summary

This research explores how to best build artificial intelligence systems that can understand both images and text at the same time, moving beyond AI that only focuses on language.

What's the problem?

Currently, it's unclear how to design AI models that truly combine visual and language information effectively. Many existing models rely heavily on pre-existing language skills, making it hard to know what's actually helping the AI understand images. The core issue is figuring out the best way to train these 'multimodal' models from scratch, without language already influencing the process, and understanding how much data each type of information (images vs. text) needs.

What's the solution?

The researchers used a specific AI framework called Transfusion and trained a model on a huge variety of data – text, videos, images with descriptions, and videos showing actions. They experimented with different techniques, like a method for creating a unified visual representation called RAE and a system called Mixture-of-Experts (MoE) to handle different types of information. They also analyzed how much computing power and data were needed for each type of information, discovering that vision requires significantly more data than language to learn effectively. The MoE architecture helped balance this difference.

Why it matters?

This work provides valuable insights into building more powerful and versatile AI. By understanding how to effectively combine vision and language, and by recognizing the different data needs of each, we can create AI systems that better understand the world around us, leading to advancements in areas like robotics, image understanding, and more human-like AI interactions.

Abstract

The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.

View Paper