From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images

Yiming Chen, Junlin Han, Tianyi Bai, Shengbang Tong, Filippos Kokkinos, Philip Torr

2025-12-01

From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images

Summary

This paper focuses on the fact that while AI models are getting good at *seeing* what's in an image, they don't understand how a person *feels* when looking at it – things like whether it's funny, beautiful, or memorable.

What's the problem?

Current AI models, specifically Multimodal Large Language Models, can identify objects and describe scenes in images, but they struggle with subjective qualities. They can't accurately predict what humans would find emotionally impactful, aesthetically pleasing, or even just memorable about an image. There wasn't a good way to measure how well these models understood these 'human' aspects of images.

What's the solution?

The researchers created a new benchmark called CogIP-Bench to specifically test how well AI models understand these subjective image qualities. They found the models weren't very good at it, so they then 'trained' the models further after their initial training, focusing on aligning the AI's responses with human opinions. This extra training significantly improved the AI's ability to predict what humans would think and feel about an image, and this improvement even helped the AI generate better images based on desired qualities like 'memorable' or 'visually appealing'.

Why it matters?

This work is important because it moves AI closer to understanding images the way humans do. It's not just about recognizing *what* is in a picture, but understanding *how* it makes us feel. This is crucial for creating AI that can be used for more creative tasks, like generating art or designing user interfaces that are genuinely engaging and emotionally resonant.

Abstract

While Multimodal Large Language Models (MLLMs) are adept at answering what is in an image-identifying objects and describing scenes-they often lack the ability to understand how an image feels to a human observer. This gap is most evident when considering subjective cognitive properties, such as what makes an image memorable, funny, aesthetically pleasing, or emotionally evocative. To systematically address this challenge, we introduce CogIP-Bench, a comprehensive benchmark for evaluating MLLMs on such image cognitive properties. Our evaluation reveals a significant gap: current models are poorly aligned with human perception of these nuanced properties. We then demonstrate that a post-training phase can effectively bridge this gap, significantly enhancing the model's alignment with human judgments. Furthermore, we show that this learned cognitive alignment is not merely predictive but also transferable to downstream creative tasks. By integrating our cognitively-aligned MLLM into an image generation pipeline, we can guide the synthesis process to produce images that better embody desired traits, such as being more memorable or visually appealing. Our work provides a benchmark to measure this human-like perception, a post-training pipeline to enhance it, and a demonstration that this alignment unlocks more human-centric AI.

View Paper