AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (2404.09624v3)

Published 15 Apr 2024 in cs.CV

Abstract: The highly abstract nature of image aesthetics perception (IAP) poses significant challenge for current multimodal LLMs (MLLMs). The lack of human-annotated multi-modality aesthetic data further exacerbates this dilemma, resulting in MLLMs falling short of aesthetics perception capabilities. To address the above challenge, we first introduce a comprehensively annotated Aesthetic Multi-Modality Instruction Tuning (AesMMIT) dataset, which serves as the footstone for building multi-modality aesthetics foundation models. Specifically, to align MLLMs with human aesthetics perception, we construct a corpus-rich aesthetic critique database with 21,904 diverse-sourced images and 88K human natural language feedbacks, which are collected via progressive questions, ranging from coarse-grained aesthetic grades to fine-grained aesthetic descriptions. To ensure that MLLMs can handle diverse queries, we further prompt GPT to refine the aesthetic critiques and assemble the large-scale aesthetic instruction tuning dataset, i.e. AesMMIT, which consists of 409K multi-typed instructions to activate stronger aesthetic capabilities. Based on the AesMMIT database, we fine-tune the open-sourced general foundation models, achieving multi-modality Aesthetic Expert models, dubbed AesExpert. Extensive experiments demonstrate that the proposed AesExpert models deliver significantly better aesthetic perception performances than the state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision. Project homepage: https://yipoh.github.io/aes-expert/.

PDF HTML Abstract

Overview of "AesExpert: Towards Multi-Modality Foundation Model for Image Aesthetics Perception"

The paper presents "AesExpert," a novel approach for enhancing the image aesthetics perception capabilities of multimodal LLMs (MLLMs). Recognizing the deficiency in human-annotated multimodal aesthetic data, the authors introduce a new dataset, Aesthetic Multi-Modality Instruction Tuning (AesMMIT), designed to bridge the gap between MLLMs and human aesthetic judgment.

Key Contributions

The authors make several notable contributions:

Aesthetic Instruction-Following Dataset: AesMMIT is built on a corpus of aesthetic critiques collected through subjective experiments. It consists of 409K multi-type instructions derived from 21,904 diverse images and 88K human feedbacks. These incorporate various perception dimensions such as quality, attribute, emotion, and context reasoning.
AesExpert Model: The paper fine-tunes existing MLLMs on the AesMMIT data, resulting in the AesExpert models. These models exhibit superior performance in aesthetics perception compared to contemporary MLLMs, including GPT-4V and Gemini-Pro-Vision.
Open-source Contribution: The dataset and models, including their codes and checkpoints, are made publicly available, which could support further advancements in MLLMs with comprehensive aesthetic capabilities.

Methodology

The methodology involves three main stages:

Collecting Human Feedback: 48 subjects provided detailed aesthetic feedback on images, focusing on coarse and fine-grained aesthetic evaluations and feelings.
GPT-Assisted Refinement: The authors used GPT to generate diverse instruction-following data formats from the human feedback, enhancing the dataset's breadth.
Model Fine-tuning: The researchers fine-tuned pre-existing MLLMs on this comprehensive dataset to derive the AesExpert models, targeting improved aesthetic interactions.

Results

Extensive evaluations on the AesBench benchmark demonstrate substantial improvements in aesthetic perception tasks. The models exhibit notably enhanced capabilities in aesthetic interpretation, perception, and empathy. Performance improvements, especially in assessing artificial intelligence-generated images, highlight the dataset’s efficacy.

Implications and Future Directions

The work implies significant potential for MLLMs to serve roles requiring nuanced aesthetic understanding, such as in smart photography and media curation. Future advancements may explore expanding datasets to include more diverse aesthetic contexts and improve MLLMs' interpretative precision further.

Overall, this research lays foundational work for developing models that replicate human-like aesthetic judgments, catalyzing progress in AI that interacts with visual domains more profoundly.

PDF Markdown Bookmark Chat (Pro)

References (60)

Authors (9)

Yipo Huang (3 papers)
Xiangfei Sheng (2 papers)
Zhichao Yang (37 papers)
Quan Yuan (37 papers)
Zhichao Duan (8 papers)
Pengfei Chen (52 papers)
Leida Li (26 papers)
Weisi Lin (118 papers)
Guangming Shi (87 papers)

Citations (9)

View on Semantic Scholar