G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

Liang Chen, Hongcheng Gao, Tianyu Liu, Zhiqi Huang, Flood Sung, Xinyu Zhou, Yuxin Wu, Baobao Chang

2025-05-27

G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language
Model via Reinforcement Learning

Summary

This paper talks about a new way to train vision-language models, which are AI systems that can understand both pictures and words. The method uses a special training environment called VLM-Gym and reinforcement learning to help these models get better at both seeing and thinking, especially when they have to interact with something, like in video games.

What's the problem?

The problem is that even though vision-language models can recognize images and understand text, they often struggle to actually do things with that knowledge, like making decisions or solving problems in interactive situations. This gap between what they know and what they can do limits their usefulness.

What's the solution?

The authors train these models in a simulated environment where they have to use both their perception and reasoning skills to succeed. By using reinforcement learning, which rewards the model for making good choices, the models learn to connect what they see and read with the actions they take. This training helps them perform better in interactive tasks than previous models.

Why it matters?

This is important because it means AI systems can become more capable and reliable in real-world situations where they need to understand both images and language and then act on that understanding. This could lead to smarter robots, better game AIs, and more helpful digital assistants.

Abstract

VLM-Gym addresses the "knowing-doing" gap in Vision-Language Models by training them in a diverse RL environment, leading to enhanced perception and reasoning abilities that surpass existing models in interactive games.

View Paper