Listener-Rewarded Thinking in VLMs for Image Preferences

Alexander Gambashidze, Li Pengyi, Matvey Skripkin, Andrey Galichin, Anton Gusarov, Konstantin Sobolev, Andrey Kuznetsov, Ivan Oseledets

2025-07-01

Listener-Rewarded Thinking in VLMs for Image Preferences

Summary

This paper talks about a new way to make AI models better at understanding and matching human preferences for images and videos. The method uses a 'listener' model that checks if the AI’s explanations for choosing certain images make sense.

What's the problem?

Existing AI models often give answers that conflict with their own explanations, making them less accurate and less reliable when predicting what people will like.

What's the solution?

The researchers created a listener-augmented framework where a frozen vision-language model acts as a listener, evaluating the AI's reasoning step-by-step and giving feedback in the form of a reward. This encourages the AI not only to choose the right images but also to explain its choices convincingly.

Why it matters?

This matters because it helps AI systems better understand human tastes and preferences, leading to improved alignment between AI-generated content and what people actually want, which is important for making better art, videos, and other media.

Abstract

A listener-augmented GRPO framework improves the accuracy and generalization of reward models for aligning text-to-image and text-to-video generative models with human preferences.

View Paper