Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

Yifu Yuan, Haiqin Cui, Yaoting Huang, Yibin Chen, Fei Ni, Zibin Dong, Pengyi Li, Yan Zheng, Jianye Hao

2025-08-20

Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

Summary

This paper introduces a new way for robots to understand and act in the world by using "pointing" as a middle step. They created a smart AI model called Embodied-R1 and a huge dataset to teach it how to point. This model can now do many robot tasks really well, even ones it hasn't seen before, and it's much better than older methods at understanding what it sees and then doing the right action.

What's the problem?

Robots have trouble learning to do new tasks because it's hard for them to connect what they see with the actual physical actions they need to perform. This is because there isn't enough training data, and different robots are built in different ways, making it difficult to create a general way for them to learn.

What's the solution?

The researchers came up with "pointing" as a universal language for robots, allowing them to understand visual information and then translate it into specific movements. They built a powerful AI model, Embodied-R1, trained on a massive collection of pointing-related data. This model uses a special training method that rewards it for learning these pointing skills, enabling it to perform a variety of tasks effectively.

Why it matters?

This research is important because it provides a way to overcome the difficulty robots have in bridging the gap between seeing and doing. By using pointing as a common language and developing a more effective AI training approach, this work significantly improves robots' ability to generalize their skills to new situations and real-world tasks, making them more capable and versatile.

Abstract

Generalization in embodied AI is hindered by the "seeing-to-doing gap," which stems from data scarcity and embodiment heterogeneity. To address this, we pioneer "pointing" as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K, which supports key embodied pointing capabilities. We then train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.

View Paper