OViP: Online Vision-Language Preference Learning
Shujun Liu, Siyuan Wang, Zejun Li, Jianxiang Wang, Cheng Zeng, Zhongyu Wei
2025-05-23
Summary
This paper talks about a new method called OViP that helps computer models which understand both pictures and text to make fewer mistakes by learning from better training examples.
What's the problem?
The problem is that large vision-language models sometimes make things up or 'hallucinate' when they try to describe or answer questions about images, which can make them unreliable.
What's the solution?
The researchers created OViP, a system that uses a special kind of AI called a diffusion model to automatically create challenging training examples on the fly. This helps the main model learn to tell the difference between good and bad answers, making it less likely to hallucinate while still being able to handle both pictures and text.
Why it matters?
This is important because it makes these models more trustworthy and accurate, which is really useful for things like search engines, educational tools, and any application where understanding both images and language is needed.
Abstract
OViP dynamically generates contrastive training data using a diffusion model to reduce hallucinations in large vision-language models while maintaining their multi-modal capabilities.