Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
Wang Xiyao, Yang Zhengyuan, Li Linjie, Lu Hongjin, Xu Yuancheng, Lin Chung-Ching Lin, Lin Kevin, Huang Furong, Wang Lijuan
2024-12-06
Summary
This paper talks about the Vision Value Model (VisVM), a new approach designed to improve how vision-language models (VLMs) understand and generate responses based on images and text.
What's the problem?
Although vision-language models have made great strides in understanding images and text together, they often generate responses that lack detail or accuracy. This can lead to 'hallucinations,' where the model creates information that isn't actually present in the visual content. Improving the quality of these responses is challenging, especially during the inference stage when the model generates answers.
What's the solution?
The authors introduce VisVM, which helps guide the generation process by evaluating not just the current response but also predicting how well future responses will align with the visual content. This long-term perspective allows VisVM to steer VLMs away from generating vague or incorrect sentences. The model uses advanced techniques to ensure that each generated sentence is coherent and relevant to the image, resulting in more detailed and accurate descriptions.
Why it matters?
This research is important because it enhances the ability of AI systems to provide better visual understanding and descriptions, which can be useful in various applications like automated captioning, image search, and accessibility tools for visually impaired users. By improving how VLMs generate responses, VisVM could lead to more reliable and informative interactions between images and text.
Abstract
Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improve VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs. Our value model and code are available at https://github.com/si0wang/VisVM.