Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models

Qi Wu, Zipeng Fu, Xuxin Cheng, Xiaolong Wang, Chelsea Finn

2024-10-02

Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models

Summary

This paper discusses Helpful DoggyBot, a system that allows legged robots to fetch objects in indoor environments by understanding commands through vision and language models.

What's the problem?

Legged robots, like those with four legs, have become good at moving around, but they struggle with tasks that require interacting with objects in indoor spaces. This is due to several challenges, including the lack of tools for picking things up, limited understanding of their surroundings using only simulation data, and difficulties navigating tight indoor spaces.

What's the solution?

The authors developed a system that equips these robots with a front-mounted gripper for picking up objects. They trained a low-level controller using simulated depth data to help the robot perform agile movements like climbing and tilting. Additionally, they used pre-trained vision-language models that allow the robot to understand commands and recognize objects using different camera perspectives. The system was tested in two new environments without any prior training or data collection, and it successfully followed commands to fetch items, achieving a 60% success rate.

Why it matters?

This research is important because it enhances the capabilities of legged robots in real-world settings, making them more useful for tasks like fetching items in homes or offices. By improving how these robots understand and interact with their environments, Helpful DoggyBot could lead to more advanced applications in service robotics and automation.

Abstract

Learning-based methods have achieved strong performance for quadrupedal locomotion. However, several challenges prevent quadrupeds from learning helpful indoor skills that require interaction with environments and humans: lack of end-effectors for manipulation, limited semantic understanding using only simulation data, and low traversability and reachability in indoor environments. We present a system for quadrupedal mobile manipulation in indoor environments. It uses a front-mounted gripper for object manipulation, a low-level controller trained in simulation using egocentric depth for agile skills like climbing and whole-body tilting, and pre-trained vision-language models (VLMs) with a third-person fisheye and an egocentric RGB camera for semantic understanding and command generation. We evaluate our system in two unseen environments without any real-world data collection or training. Our system can zero-shot generalize to these environments and complete tasks, like following user's commands to fetch a randomly placed stuff toy after climbing over a queen-sized bed, with a 60% success rate. Project website: https://helpful-doggybot.github.io/

View Paper