Robo2VLM, a framework for generating Visual Question Answering datasets using robot trajectory data, enhances and evaluates Vision-Language Models by leveraging sensory modalities and 3D property understanding.

This paper talks about Robo2VLM, a new system that uses data collected from robots doing tasks in the real world to create better question-and-answer datasets for training AI that understands both images and language.

Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets

Summary

What's the problem?

What's the solution?

Why it matters?

Abstract