R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation

Xiuwei Xu, Angyuan Ma, Hankun Li, Bingyao Yu, Zheng Zhu, Jie Zhou, Jiwen Lu

2025-10-10

R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation

Summary

This paper focuses on making robots better at grabbing and manipulating objects in the real world, even when those objects are arranged differently than what the robot has seen before.

What's the problem?

Robots usually need a lot of examples to learn how to handle objects, and those examples need to cover every possible arrangement. Getting enough real-world examples is time-consuming and difficult. Previous attempts to create more examples using computer simulations often don't translate well to the real world because of differences between the simulation and reality, and they often work only in very specific, simple setups.

What's the solution?

The researchers developed a new system called R2RGen that creates realistic training data directly from a single real-world demonstration. It works by carefully analyzing a single successful attempt at a task, then using that information to generate many similar, but slightly different, scenarios. Importantly, it doesn't rely on simulations at all, making the generated data much more accurate for real-world use. They also focused on making sure the generated data looks like what a robot would actually 'see' with its sensors, like a 3D camera.

Why it matters?

This work is important because it makes it easier to train robots to handle objects in a wider variety of situations. By needing less real-world data and avoiding the problems of simulations, robots can learn more efficiently and be deployed in more complex and unpredictable environments, like your home or a warehouse.

Abstract

Towards the aim of generalized robotic manipulation, spatial generalization is the most fundamental capability that requires the policy to work robustly under different spatial distribution of objects, environment and agent itself. To achieve this, substantial human demonstrations need to be collected to cover different spatial configurations for training a generalized visuomotor policy via imitation learning. Prior works explore a promising direction that leverages data generation to acquire abundant spatially diverse data from minimal source demonstrations. However, most approaches face significant sim-to-real gap and are often limited to constrained settings, such as fixed-base scenarios and predefined camera viewpoints. In this paper, we propose a real-to-real 3D data generation framework (R2RGen) that directly augments the pointcloud observation-action pairs to generate real-world data. R2RGen is simulator- and rendering-free, thus being efficient and plug-and-play. Specifically, given a single source demonstration, we introduce an annotation mechanism for fine-grained parsing of scene and trajectory. A group-wise augmentation strategy is proposed to handle complex multi-object compositions and diverse task constraints. We further present camera-aware processing to align the distribution of generated data with real-world 3D sensor. Empirically, R2RGen substantially enhances data efficiency on extensive experiments and demonstrates strong potential for scaling and application on mobile manipulation.

View Paper