Can MLLMs Understand the Deep Implication Behind Chinese Images? (2410.13854v1)

Published 17 Oct 2024 in cs.CL, cs.AI, cs.CV, and cs.CY

Abstract: As the capabilities of Multimodal LLMs (MLLMs) continue to improve, the need for higher-order capability evaluation of MLLMs is increasing. However, there is a lack of work evaluating MLLM for higher-order perception and understanding of Chinese visual content. To fill the gap, we introduce the Chinese Image Implication understanding Benchmark, CII-Bench, which aims to assess the higher-order perception and understanding capabilities of MLLMs for Chinese images. CII-Bench stands out in several ways compared to existing benchmarks. Firstly, to ensure the authenticity of the Chinese context, images in CII-Bench are sourced from the Chinese Internet and manually reviewed, with corresponding answers also manually crafted. Additionally, CII-Bench incorporates images that represent Chinese traditional culture, such as famous Chinese traditional paintings, which can deeply reflect the model's understanding of Chinese traditional culture. Through extensive experiments on CII-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on CII-Bench. The highest accuracy of MLLMs attains 64.4%, where as human accuracy averages 78.2%, peaking at an impressive 81.0%. Subsequently, MLLMs perform worse on Chinese traditional culture images, suggesting limitations in their ability to understand high-level semantics and lack a deep knowledge base of Chinese traditional culture. Finally, it is observed that most models exhibit enhanced accuracy when image emotion hints are incorporated into the prompts. We believe that CII-Bench will enable MLLMs to gain a better understanding of Chinese semantics and Chinese-specific images, advancing the journey towards expert artificial general intelligence (AGI). Our project is publicly available at https://cii-bench.github.io/.

PDF HTML Abstract

Understanding MLLMs' Capabilities in Interpreting Chinese Image Implications

The paper "Can MLLMs Understand the Deep Implication Behind Chinese Images?" addresses the capacity of Multimodal LLMs (MLLMs) to comprehend intricate Chinese visual content. This paper primarily focuses on evaluating MLLMs using a novel benchmark called the Chinese Image Implication understanding Benchmark (CII-Bench). The research aims to explore and assess the higher-order perception, reasoning, and understanding of Chinese cultural contexts by these models.

CII-Bench Characteristics

CII-Bench comprises 698 images from the Chinese Internet, carefully curated to include traditional Chinese paintings and modern cultural depictions such as memes and posters. This diversity ensures a comprehensive evaluation of MLLMs across various domains, including Life, Art, Society, Politics, Environment, and Chinese Traditional Culture. Each image is paired with multiple-choice questions, testing the model's ability to deduce deeper meanings and cultural implications.

Experimental Evaluation

Experiments conducted across several MLLMs reveal notable findings:

A discernible gap exists between human and model performance. Humans, on average, achieved 78.2% accuracy, while the best MLLM attained only 64.4%.
MLLMs significantly struggle with images related to Chinese traditional culture, indicating a lack of intricate understanding of cultural nuances.
Incorporating image emotion hints enhances model accuracy, suggesting difficulties in grasping emotional subtleties.

Prompt Engineering Insights

The paper explores various prompts, including Chain-of-Thought (CoT) and domain-specific prompts. CoT did not significantly benefit most models and occasionally led to performance degradation. Meanwhile, providing emotion-related cues proved beneficial, reinforcing the necessity for emotion-aware processing in MLLMs.

Evaluation of Chinese Traditional Paintings

A specialized evaluation metric assesses MLLMs' comprehension of Chinese traditional paintings, focusing on aspects such as Surface-level Information and Cultural History. GPT-4o scored 2.71 on this metric, exemplifying how current models capture surface details but not the deeper cultural connotations inherent in traditional Chinese artworks.

Error Analysis

The paper's error analysis identifies prevalent model deficiencies: neglect of key information, over-inference, and superficial reasoning. Notably, models exhibit a fundamental misalignment with Chinese cultural contexts, largely attributed to a deficit in historical and cultural knowledge.

Implications and Future Directions

This research highlights significant challenges and opportunities for improving MLLMs in cross-cultural understanding and interpretation. The introduction of CII-Bench provides a platform for developing models that better understand intricate cultural nuances and advance towards expert AGI. Future work could focus on enriching cultural databases and enhancing multimodal fusion techniques.

In summary, the paper elucidates the complexities inherent in training models to seamlessly integrate and interpret visual content rich in cultural implications. It underscores the importance of developing benchmarks like CII-Bench to push the boundaries of current AI capabilities towards a more profound, culturally nuanced understanding.

PDF Markdown Bookmark Chat (Pro)

Authors (21)

Chenhao Zhang (35 papers)
Xi Feng (17 papers)
Yuelin Bai (13 papers)
Xinrun Du (23 papers)
Jinchang Hou (3 papers)
Kaixin Deng (6 papers)
Guangzeng Han (5 papers)
Qinrui Li (2 papers)
Bingli Wang (6 papers)
Jiaheng Liu (100 papers)
Xingwei Qu (30 papers)
Yifei Zhang (167 papers)
Qixuan Zhao (4 papers)
Yiming Liang (22 papers)
Ziqiang Liu (16 papers)
Feiteng Fang (12 papers)
Min Yang (239 papers)
Wenhao Huang (98 papers)
Chenghua Lin (127 papers)
Ge Zhang (170 papers)

Related Papers

Find Related Papers

GitHub

CII-Bench

Tweets

https://twitter.com/GeZhang86038849/status/1850036114306052395

https://twitter.com/arXivGPT/status/1847752180112982113

https://twitter.com/gm8xx8/status/1847127701465059839

https://twitter.com/arXivGPT/status/1848482310074597483

https://twitter.com/calculito/status/1847918964136919107