Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets
Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Ken Goldberg
2025-05-23
Summary
This paper talks about Robo2VLM, a new system that uses data collected from robots doing tasks in the real world to create better question-and-answer datasets for training AI that understands both images and language.
What's the problem?
The problem is that most visual question answering models are trained on simple or staged data, which doesn't really reflect the messy, complicated situations robots face in the real world, making these models less useful for real-life applications.
What's the solution?
The researchers built Robo2VLM, which takes information from robots, including their movements and what they sense, to make much more realistic and detailed datasets. These new datasets help train and test AI models so they can better understand 3D spaces and answer questions about what’s happening in real-world environments.
Why it matters?
This is important because it helps create smarter AI that can work alongside robots in real environments, making them more helpful for things like automated factories, home assistance, or search and rescue missions.
Abstract
Robo2VLM, a framework for generating Visual Question Answering datasets using robot trajectory data, enhances and evaluates Vision-Language Models by leveraging sensory modalities and 3D property understanding.