Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface

Yujie Zhao, Hongwei Fan, Di Chen, Shengcong Chen, Liliang Chen, Xiaoqi Li, Guanghui Ren, Hao Dong

2025-12-23

Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface

Summary

This paper introduces a new method called Real2Edit2Real for creating more training data for robots learning to manipulate objects. It focuses on making robots better at handling new situations they haven't specifically been trained on, especially when it comes to spatial understanding.

What's the problem?

Training robots to reliably perform tasks like grasping or moving objects requires a huge amount of example data. Getting this data is expensive and time-consuming, as someone needs to physically demonstrate the task many times. Current methods struggle when the robot needs to generalize to slightly different environments or object positions, meaning they need even *more* data to cover all possibilities. The biggest challenge is efficiently creating diverse training examples, particularly for tasks where the robot needs to understand where things are in 3D space.

What's the solution?

The researchers developed a system that uses 3D editing to create new training examples from a small number of real demonstrations. First, they reconstruct a 3D model of the scene from regular 2D images. Then, they virtually manipulate objects within this 3D model, creating new possible scenarios. Importantly, they make sure these virtual manipulations are physically realistic. Finally, they use this edited 3D information to generate realistic-looking videos of the robot performing the new manipulations, which are then used as training data. They use depth information as the main guide for creating these videos, along with other cues like robot actions and object edges.

Why it matters?

This work is important because it significantly reduces the amount of real-world data needed to train a robot. They showed that robots trained with data generated by their method can perform just as well, or even better, than robots trained with many more real-world examples. This makes robot learning much more practical and affordable, and opens the door to robots that can adapt to new situations more easily. The system is also flexible, allowing for different types of edits to be made, suggesting it could be used for a wide range of manipulation tasks.

Abstract

Recent progress in robot learning has been driven by large-scale datasets and powerful visuomotor policy architectures, yet policy robustness remains limited by the substantial cost of collecting diverse demonstrations, particularly for spatial generalization in manipulation tasks. To reduce repetitive data collection, we present Real2Edit2Real, a framework that generates new demonstrations by bridging 3D editability with 2D visual data through a 3D control interface. Our approach first reconstructs scene geometry from multi-view RGB observations with a metric-scale 3D reconstruction model. Based on the reconstructed geometry, we perform depth-reliable 3D editing on point clouds to generate new manipulation trajectories while geometrically correcting the robot poses to recover physically consistent depth, which serves as a reliable condition for synthesizing new demonstrations. Finally, we propose a multi-conditional video generation model guided by depth as the primary control signal, together with action, edge, and ray maps, to synthesize spatially augmented multi-view manipulation videos. Experiments on four real-world manipulation tasks demonstrate that policies trained on data generated from only 1-5 source demonstrations can match or outperform those trained on 50 real-world demonstrations, improving data efficiency by up to 10-50x. Moreover, experimental results on height and texture editing demonstrate the framework's flexibility and extensibility, indicating its potential to serve as a unified data generation framework.

View Paper