UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture
Shuo Cao, Jiayang Li, Xiaohui Li, Yuandong Pu, Kaiwen Zhu, Yuanting Gao, Siqi Luo, Yi Xin, Qi Qin, Yu Zhou, Xiangyu Chen, Wenlong Zhang, Bin Fu, Yu Qiao, Yihao Liu
2025-12-29
Summary
This paper focuses on how well AI models that can 'see' and 'understand' images, called Multimodal Large Language Models, actually grasp the subtle details of what makes an image visually appealing or flawed – things like how aesthetically pleasing it is, its quality, and its textures.
What's the problem?
Current AI models are pretty good at identifying *what* is in an image, like recognizing a cat or a car, but they struggle with understanding the more nuanced, human-level aspects of an image. They can't easily judge if an image is beautiful, blurry, or has interesting patterns. There wasn't a good way to systematically test and improve these 'perceptual' abilities in these models.
What's the solution?
The researchers created a new testing framework called UniPercept-Bench, which includes a detailed system for defining and measuring perceptual image understanding across areas like aesthetics and quality. They also built large datasets to use for testing. Then, they developed a new AI model, UniPercept, and trained it using a special method that helps it adapt to different visual tasks and learn from rewards, making it better at these perceptual judgments.
Why it matters?
This work is important because it provides a standard way to evaluate and improve how well AI understands images on a deeper, more human-like level. This isn't just about making prettier pictures; it's crucial for applications like improving image generation AI, helping AI give better feedback on photos, and ultimately building AI that can truly 'see' the world as we do.
Abstract
Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks such as visual grounding, segmentation, and captioning. However, their ability to perceive perceptual-level image features remains limited. In this work, we present UniPercept-Bench, a unified framework for perceptual-level image understanding across three key domains: Aesthetics, Quality, Structure and Texture. We establish a hierarchical definition system and construct large-scale datasets to evaluate perceptual-level image understanding. Based on this foundation, we develop a strong baseline UniPercept trained via Domain-Adaptive Pre-Training and Task-Aligned RL, enabling robust generalization across both Visual Rating (VR) and Visual Question Answering (VQA) tasks. UniPercept outperforms existing MLLMs on perceptual-level image understanding and can serve as a plug-and-play reward model for text-to-image generation. This work defines Perceptual-Level Image Understanding in the era of MLLMs and, through the introduction of a comprehensive benchmark together with a strong baseline, provides a solid foundation for advancing perceptual-level multimodal image understanding.