< Explain other AI papers

Can MLLMs Understand the Deep Implication Behind Chinese Images?

Chenhao Zhang, Xi Feng, Yuelin Bai, Xinrun Du, Jinchang Hou, Kaixin Deng, Guangzeng Han, Qinrui Li, Bingli Wang, Jiaheng Liu, Xingwei Qu, Yifei Zhang, Qixuan Zhao, Yiming Liang, Ziqiang Liu, Feiteng Fang, Min Yang, Wenhao Huang, Chenghua Lin, Ge Zhang, Shiwen Ni

2024-10-18

Can MLLMs Understand the Deep Implication Behind Chinese Images?

Summary

This paper introduces the Chinese Image Implication understanding Benchmark (CII-Bench), which evaluates how well Multimodal Large Language Models (MLLMs) can understand complex Chinese images and their deeper meanings.

What's the problem?

As MLLMs improve, there is a growing need to assess their ability to understand not just basic visual content but also the deeper cultural and contextual implications behind images, especially those related to Chinese culture. However, there has been little research focused on this area, leading to a gap in understanding how well these models can interpret Chinese visual content.

What's the solution?

To fill this gap, the authors created CII-Bench, a benchmark specifically designed to evaluate MLLMs on their understanding of Chinese images. The images used in CII-Bench are sourced from the Chinese Internet and include traditional cultural representations, ensuring they reflect authentic Chinese contexts. The benchmark includes multiple-choice questions that test the models' comprehension of these images. The authors conducted experiments with various MLLMs and found that while the models performed reasonably well, they still lagged behind human accuracy, particularly when it came to understanding traditional cultural images.

Why it matters?

This research is important because it helps improve how AI models understand cultural nuances and visual information. By developing CII-Bench, the authors provide a tool that can help advance MLLMs toward better comprehension of specific cultural contexts, which is crucial for applications in areas like education, media, and communication. Ultimately, this work contributes to the broader goal of achieving more advanced artificial general intelligence (AGI) that can understand and interact with diverse cultures effectively.

Abstract

As the capabilities of Multimodal Large Language Models (MLLMs) continue to improve, the need for higher-order capability evaluation of MLLMs is increasing. However, there is a lack of work evaluating MLLM for higher-order perception and understanding of Chinese visual content. To fill the gap, we introduce the **C**hinese **I**mage **I**mplication understanding **Bench**mark, **CII-Bench**, which aims to assess the higher-order perception and understanding capabilities of MLLMs for Chinese images. CII-Bench stands out in several ways compared to existing benchmarks. Firstly, to ensure the authenticity of the Chinese context, images in CII-Bench are sourced from the Chinese Internet and manually reviewed, with corresponding answers also manually crafted. Additionally, CII-Bench incorporates images that represent Chinese traditional culture, such as famous Chinese traditional paintings, which can deeply reflect the model's understanding of Chinese traditional culture. Through extensive experiments on CII-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on CII-Bench. The highest accuracy of MLLMs attains 64.4%, where as human accuracy averages 78.2%, peaking at an impressive 81.0%. Subsequently, MLLMs perform worse on Chinese traditional culture images, suggesting limitations in their ability to understand high-level semantics and lack a deep knowledge base of Chinese traditional culture. Finally, it is observed that most models exhibit enhanced accuracy when image emotion hints are incorporated into the prompts. We believe that CII-Bench will enable MLLMs to gain a better understanding of Chinese semantics and Chinese-specific images, advancing the journey towards expert artificial general intelligence (AGI). Our project is publicly available at https://cii-bench.github.io/.