DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping

Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Yitao Liang, Yaodong Yang, Yuanpei Chen

2025-03-03

DexGraspVLA: A Vision-Language-Action Framework Towards General
Dexterous Grasping

Summary

This paper talks about DexGraspVLA, a new system that helps robots grab objects more like humans do. It combines artificial intelligence that understands images and language with a special learning method to make robots better at picking up all sorts of objects in different situations.

What's the problem?

Robots have a hard time picking up different objects in various settings. Current methods often work only for specific objects or in limited environments, which means robots aren't very flexible when it comes to grasping things in the real world.

What's the solution?

The researchers created DexGraspVLA, which uses two main parts: a high-level planner that understands images and language to figure out what to do, and a low-level controller that learns how to actually move and grab objects. This system can take in different types of information (like images and words) and turn them into a format that helps the robot learn and adapt to new situations more easily.

Why it matters?

This matters because it could make robots much more useful in everyday life. With a 90% success rate in grabbing objects it's never seen before, even with different lighting and backgrounds, DexGraspVLA could help robots work in homes, factories, or anywhere else where they need to handle various objects. This is a big step towards having robots that can work alongside humans in many different situations, potentially making many tasks easier and more efficient.

Abstract

Dexterous grasping remains a fundamental yet challenging problem in robotics. A general-purpose robot must be capable of grasping diverse objects in arbitrary scenarios. However, existing research typically relies on specific assumptions, such as single-object settings or limited environments, leading to constrained generalization. Our solution is DexGraspVLA, a hierarchical framework that utilizes a pre-trained Vision-Language model as the high-level task planner and learns a diffusion-based policy as the low-level Action controller. The key insight lies in iteratively transforming diverse language and visual inputs into domain-invariant representations, where imitation learning can be effectively applied due to the alleviation of domain shift. Thus, it enables robust generalization across a wide range of real-world scenarios. Notably, our method achieves a 90+% success rate under thousands of unseen object, lighting, and background combinations in a ``zero-shot'' environment. Empirical analysis further confirms the consistency of internal model behavior across environmental variations, thereby validating our design and explaining its generalization performance. We hope our work can be a step forward in achieving general dexterous grasping. Our demo and code can be found at https://dexgraspvla.github.io/.

View Paper