GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs
Pu Hua, Minghuan Liu, Annabella Macaluso, Yunfeng Lin, Weinan Zhang, Huazhe Xu, Lirui Wang
2024-10-07

Summary
This paper presents GenSim2, a new framework that improves how robots learn to perform tasks by automatically generating realistic simulation data using advanced language models.
What's the problem?
Creating diverse and realistic simulations for robots is time-consuming and requires a lot of human effort. Many existing methods focus on a single task, which limits the robot's ability to learn and adapt to different situations. This makes it hard for robots to transfer what they've learned in simulations to real-world applications.
What's the solution?
To solve these issues, the authors developed GenSim2, which uses coding language models with multi-modal and reasoning capabilities to create complex simulation tasks automatically. This framework can generate data for up to 100 different tasks involving 200 objects, significantly reducing the amount of work needed from humans. It also includes a new policy architecture called the proprioceptive point-cloud transformer (PPT), which helps robots learn from the generated data and perform well in real-world scenarios without needing extensive retraining.
Why it matters?
This research is important because it makes it easier and faster to train robots for various tasks, improving their ability to operate in real-life situations. By generating realistic training data automatically, GenSim2 could lead to more capable robots that can assist in everyday tasks, manufacturing, and logistics, ultimately enhancing productivity and efficiency.
Abstract
Robotic simulation today remains challenging to scale up due to the human efforts required to create diverse simulation tasks and scenes. Simulation-trained policies also face scalability issues as many sim-to-real methods focus on a single task. To address these challenges, this work proposes GenSim2, a scalable framework that leverages coding LLMs with multi-modal and reasoning capabilities for complex and realistic simulation task creation, including long-horizon tasks with articulated objects. To automatically generate demonstration data for these tasks at scale, we propose planning and RL solvers that generalize within object categories. The pipeline can generate data for up to 100 articulated tasks with 200 objects and reduce the required human efforts. To utilize such data, we propose an effective multi-task language-conditioned policy architecture, dubbed proprioceptive point-cloud transformer (PPT), that learns from the generated demonstrations and exhibits strong sim-to-real zero-shot transfer. Combining the proposed pipeline and the policy architecture, we show a promising usage of GenSim2 that the generated data can be used for zero-shot transfer or co-train with real-world collected data, which enhances the policy performance by 20% compared with training exclusively on limited real data.