EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control

Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, Maoqing Yao, Haoran Yang, Jiacheng Bao, Bin Zhao, Dong Wang

2025-09-01

EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control

Summary

This paper introduces a new system called EO-Robotics, which aims to make robots better at understanding and interacting with the real world like humans do, by combining vision, language, and actions.

What's the problem?

Current robots, even those using advanced vision-language-action models, struggle with tasks that require them to constantly switch between understanding what they see, what they're told, and what actions to take. They lack the flexibility to handle complex, real-world situations that demand all three at once, and aren't as good at generalizing to new scenarios.

What's the solution?

The researchers developed EO-1, a powerful new 'foundation model' for robots. This model is unique because it can process images, text, videos, and robot actions all in the same way. They also created a huge dataset, EO-Data1.5M, with over 1.5 million examples specifically designed to train the robot to understand the connection between seeing, hearing, and doing. EO-1 learns by predicting what comes next in sequences of these inputs, allowing it to generate appropriate actions for a given situation.

Why it matters?

This work is important because it represents a significant step towards creating robots that can truly understand and operate in the real world with human-like intelligence. By improving a robot’s ability to reason about what it sees and hears, and then act accordingly, we can build robots capable of tackling more complex and useful tasks in everyday life.

Abstract

The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.

View Paper