Emu3.5: Native Multimodal Models are World Learners

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang

2025-10-31

Emu3.5: Native Multimodal Models are World Learners

Summary

This paper introduces Emu3.5, a new artificial intelligence model that can understand and generate both images and text together, essentially predicting what happens next in a visual and linguistic world.

What's the problem?

Existing AI models often struggle to seamlessly connect what they 'see' (images) with what they 'read' (text) and then predict what will happen next in a realistic way. Generating detailed, consistent, and complex images from text prompts, or predicting future frames in a video, is a difficult challenge for current systems, and they can be slow at creating images.

What's the solution?

The researchers created Emu3.5 by training it on a massive amount of video and text data – over 10 trillion pieces of information! This training allows it to predict the next 'token' (a piece of text or image data) in a sequence. They also used a technique called Discrete Diffusion Adaptation (DiDA) to speed up image generation, making it about 20 times faster without losing quality. Finally, they refined the model using reinforcement learning to improve its reasoning and generation abilities.

Why it matters?

Emu3.5 is important because it represents a significant step forward in AI's ability to understand and interact with the world in a multimodal way – meaning it can process and generate both images and text effectively. It performs as well as, or better than, other leading models like Gemini 2.5 Flash, and the researchers are making it publicly available so other scientists can build upon their work, potentially leading to advancements in areas like robotics, virtual reality, and content creation.

Abstract

We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.

View Paper