Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
Ruixiang Jiang, Changwen Chen
2025-01-16
Summary
This paper talks about teaching AI models that can understand both text and images (called Multimodal Large Language Models or MLLMs) to judge the beauty of artwork without any special training. It's like giving a computer the ability to be an art critic right out of the box.
What's the problem?
Current AI models are great at many tasks, but they struggle with understanding art in the same way humans do. They often make things up or give answers that don't match what people think about the art. This is because judging art is very subjective and requires understanding complex cultural and emotional aspects that are hard for computers to grasp.
What's the solution?
The researchers created a new dataset called MM-StyleBench, which is like a big art gallery with detailed descriptions for AI to learn from. They also developed a method called ArtCoT that helps the AI think about art in a more structured way, breaking down the task into smaller steps and using more specific language. This approach helps the AI give opinions about art that are more similar to what humans would say.
Why it matters?
This research matters because it could lead to AI that understands and creates art in ways that feel more natural to humans. It could help in developing better tools for artists, improving how we search for and categorize artwork online, and even in creating new forms of digital art. By making AI better at understanding the subjective aspects of art, we're moving closer to AI systems that can engage with human culture in more meaningful ways.
Abstract
We present the first study on how Multimodal LLMs' (MLLMs) reasoning ability shall be elicited to evaluate the aesthetics of artworks. To facilitate this investigation, we construct MM-StyleBench, a novel high-quality dataset for benchmarking artistic stylization. We then develop a principled method for human preference modeling and perform a systematic correlation analysis between MLLMs' responses and human preference. Our experiments reveal an inherent hallucination issue of MLLMs in art evaluation, associated with response subjectivity. ArtCoT is proposed, demonstrating that art-specific task decomposition and the use of concrete language boost MLLMs' reasoning ability for aesthetics. Our findings offer valuable insights into MLLMs for art and can benefit a wide range of downstream applications, such as style transfer and artistic image generation. Code available at https://github.com/songrise/MLLM4Art.