ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, Linjie Li, Furong Huang, Lijuan Wang

2025-06-16

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual
Perception in VLMs

Summary

This paper talks about ViCrit, a new way to improve how vision-language models (VLMs) understand images by training them to spot small mistakes called hallucinations in detailed image captions. It uses reinforcement learning to teach the model to find exactly which words in the caption are wrong compared to the image, making the model better at paying attention to fine details in images.

What's the problem?

The problem is that it is hard to train VLMs with reinforcement learning because existing visual tasks are either too complicated to check automatically or involve long descriptions that are difficult to score clearly. This makes it tough to give the models clear feedback on what they got right or wrong in understanding images.

What's the solution?

The solution was to create a task that injects a tiny, realistic mistake in a long, human-written image caption and asks the model to find the exact part that is wrong by looking at the image. This provides a simple yes-or-no reward for reinforcement learning, making training easier and more precise. By learning to detect these subtle errors, the model improves its visual perception skills in many different types of images and tasks.

Why it matters?

This matters because better visual perception in AI models helps them understand images more accurately, which is important for many applications like reading charts, recognizing objects, or understanding visuals in ways closer to humans. The ViCrit method also helps models learn general visual skills that transfer to different kinds of images beyond those seen during training.

Abstract

ViCrit, an RL task for fine-tuning VLMs, improves visual perception by training models to detect subtle hallucinations in image captions, with gains transferable to various visual domains.

View Paper