LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo
2024-07-01
Summary
This paper talks about LLaRA, a new system designed to improve how robots learn to perform tasks by using conversation-style instructions. It aims to enhance the way robots understand and act based on visual and textual information.
What's the problem?
Robots need to learn how to perform various tasks effectively, but traditional training methods can be limited. These methods often rely on straightforward data that doesn't capture the complexity of real-world situations. As a result, robots may struggle to understand instructions or make decisions based on what they see in their environment.
What's the solution?
To solve this problem, the authors developed LLaRA (Large Language and Robotics Assistant), which uses a framework that turns robot action policies into conversation-style instruction-response pairs. They created an automated system to generate high-quality training data from existing robot behavior data. This allows the robot's learning model to be fine-tuned using diverse and meaningful instructions, helping it make better decisions based on visual inputs. The authors tested LLaRA in different simulated and real-world environments and found that it performed exceptionally well compared to other methods.
Why it matters?
This research is important because it shows how robots can learn more effectively by using natural language instructions that mimic human conversation. By improving the way robots are trained, LLaRA can lead to better performance in various applications, such as manufacturing, healthcare, and service industries, where robots need to interact with their environment and follow complex instructions.
Abstract
Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.