Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation
Qingwen Bu, Hongyang Li, Li Chen, Jisong Cai, Jia Zeng, Heming Cui, Maoqing Yao, Yu Qiao
2024-10-16

Summary
This paper introduces RoboDual, a new robotic system that combines the strengths of generalist and specialist policies to improve robotic manipulation in various environments.
What's the problem?
Robots need to be versatile to work in different settings, but existing systems either focus on being general (good at many tasks) or specialist (excellent at specific tasks). Generalist systems can be slow and costly to train, while specialist systems lack the ability to adapt to new situations. This creates a gap in effective robotic manipulation.
What's the solution?
RoboDual is designed as a dual-system that takes advantage of both approaches. It uses a generalist model that understands high-level tasks and a specialist model that efficiently executes specific actions. The generalist provides guidance, while the specialist quickly performs actions based on that guidance. This combination allows RoboDual to be both adaptable and efficient, achieving better performance in real-world tasks compared to other models.
Why it matters?
This research is important because it advances the field of robotics by creating systems that can handle a wide range of tasks more effectively. By integrating different types of learning models, RoboDual can lead to smarter robots that are better suited for complex environments, such as in manufacturing, healthcare, or service industries.
Abstract
The increasing demand for versatile robotic systems to operate in diverse and dynamic environments has emphasized the importance of a generalist policy, which leverages a large cross-embodiment data corpus to facilitate broad adaptability and high-level reasoning. However, the generalist would struggle with inefficient inference and cost-expensive training. The specialist policy, instead, is curated for specific domain data and excels at task-level precision with efficiency. Yet, it lacks the generalization capacity for a wide range of applications. Inspired by these observations, we introduce RoboDual, a synergistic dual-system that supplements the merits of both generalist and specialist policy. A diffusion transformer-based specialist is devised for multi-step action rollouts, exquisitely conditioned on the high-level task understanding and discretized action output of a vision-language-action (VLA) based generalist. Compared to OpenVLA, RoboDual achieves 26.7% improvement in real-world setting and 12% gain on CALVIN by introducing a specialist policy with merely 20M trainable parameters. It maintains strong performance with 5% of demonstration data only, and enables a 3.8 times higher control frequency in real-world deployment. Code would be made publicly available. Our project page is hosted at: https://opendrivelab.com/RoboDual/