MiMo-Embodied: X-Embodied Foundation Model Technical Report
Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, Xianhui Meng, Yuchen Zhang, Jing Wu, Jinghui Lu, Chenxu Dang, Jiayi Guan, Jianhua Wu, Zhiyi Hou, Hanbing Li, Shumeng Xia, Mingliang Zhou, Yinan Zheng, Zihao Yue
2025-11-21
Summary
This paper introduces MiMo-Embodied, a new artificial intelligence model that's really good at two different, but related, things: driving cars on its own and helping robots understand and interact with the world around them.
What's the problem?
Traditionally, AI models are built to be good at *one* specific task. For example, a self-driving car AI is different from an AI that helps a robot pick up objects. This means a lot of separate work and doesn't take advantage of the fact that understanding the world is helpful for *both* driving and robotics. The challenge was to create a single AI that could excel at both, learning from the similarities between the two fields.
What's the solution?
The researchers created MiMo-Embodied by first training it in stages, using carefully selected data. Then, they used two techniques called 'Chain of Thought' and 'Reinforcement Learning' to fine-tune the model. Essentially, they taught the AI to 'think through' problems and learn from its mistakes. This allowed the model to share knowledge between the driving and robotics tasks, improving performance in both areas.
Why it matters?
This is a big step forward because it shows that different AI areas can benefit from each other. By building a single model that's good at both driving and robotics, we can potentially create more efficient and capable AI systems. Plus, the researchers are sharing their model and code publicly, which means other scientists can build on their work and accelerate progress in both fields.
Abstract
We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.