Online Self-Calibration Against Hallucination in Vision-Language Models

Minghui Chen, Chenxu Yang, Hengjie Zhu, Dayan Wu, Zheng Lin, Qingyi Si

2026-05-04

Online Self-Calibration Against Hallucination in Vision-Language Models

Summary

This paper addresses the issue of large vision-language models, which are AI systems that understand both images and text, making mistakes and 'hallucinating' details that aren't actually present in the images they're looking at.

What's the problem?

These models are often trained by trying to copy the answers of even *better* AI models, but this creates a problem. The student model tries to learn very specific details that it can't actually 'see' or understand from the image itself, leading it to essentially guess instead of truly understanding what's there. It's like trying to memorize answers without understanding the concepts.

What's the solution?

The researchers noticed that these models are actually pretty good at *checking* if a statement about an image is true or false, but not as good at *creating* descriptions from scratch. They used this difference to create a system called OSCAR. OSCAR uses a method similar to how AI plays games, exploring different possibilities and rewarding the model when it makes accurate, grounded statements about the image. It then uses this self-generated feedback to improve itself over time.

Why it matters?

This work is important because it provides a way to train these vision-language models to be more reliable and truthful. By allowing the model to learn from its own ability to verify information, rather than relying on potentially flawed examples from other models, it can reduce hallucinations and become better at genuinely understanding the connection between images and language, ultimately improving their usefulness in real-world applications.

Abstract

Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from stronger models such as GPT. However, this offline paradigm introduces a Supervision-Perception Mismatch: the student model is forced to align with fine-grained details beyond its perceptual capacity, learning to guess rather than to see. To obtain reliable self-supervision for online learning, we identify a Generative-Discriminative Gap within LVLMs, where models exhibit higher accuracy on discriminative verification than open-ended generation. Leveraging this capability, we propose Online Self-CAlibRation (OSCAR), a framework that integrates Monte Carlo Tree Search with a Dual-Granularity Reward Mechanism to construct preference data and iteratively refines the model via Direct Preference Optimization. Extensive experiments demonstrate that OSCAR achieves state-of-the-art performance on hallucination benchmarks while improving general multimodal capabilities.

View Paper