Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (2404.09624v3)

Published 15 Apr 2024 in cs.CV
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception

Abstract: The highly abstract nature of image aesthetics perception (IAP) poses significant challenge for current multimodal LLMs (MLLMs). The lack of human-annotated multi-modality aesthetic data further exacerbates this dilemma, resulting in MLLMs falling short of aesthetics perception capabilities. To address the above challenge, we first introduce a comprehensively annotated Aesthetic Multi-Modality Instruction Tuning (AesMMIT) dataset, which serves as the footstone for building multi-modality aesthetics foundation models. Specifically, to align MLLMs with human aesthetics perception, we construct a corpus-rich aesthetic critique database with 21,904 diverse-sourced images and 88K human natural language feedbacks, which are collected via progressive questions, ranging from coarse-grained aesthetic grades to fine-grained aesthetic descriptions. To ensure that MLLMs can handle diverse queries, we further prompt GPT to refine the aesthetic critiques and assemble the large-scale aesthetic instruction tuning dataset, i.e. AesMMIT, which consists of 409K multi-typed instructions to activate stronger aesthetic capabilities. Based on the AesMMIT database, we fine-tune the open-sourced general foundation models, achieving multi-modality Aesthetic Expert models, dubbed AesExpert. Extensive experiments demonstrate that the proposed AesExpert models deliver significantly better aesthetic perception performances than the state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision. Project homepage: https://yipoh.github.io/aes-expert/.

Overview of "AesExpert: Towards Multi-Modality Foundation Model for Image Aesthetics Perception"

The paper presents "AesExpert," a novel approach for enhancing the image aesthetics perception capabilities of multimodal LLMs (MLLMs). Recognizing the deficiency in human-annotated multimodal aesthetic data, the authors introduce a new dataset, Aesthetic Multi-Modality Instruction Tuning (AesMMIT), designed to bridge the gap between MLLMs and human aesthetic judgment.

Key Contributions

The authors make several notable contributions:

  1. Aesthetic Instruction-Following Dataset: AesMMIT is built on a corpus of aesthetic critiques collected through subjective experiments. It consists of 409K multi-type instructions derived from 21,904 diverse images and 88K human feedbacks. These incorporate various perception dimensions such as quality, attribute, emotion, and context reasoning.
  2. AesExpert Model: The paper fine-tunes existing MLLMs on the AesMMIT data, resulting in the AesExpert models. These models exhibit superior performance in aesthetics perception compared to contemporary MLLMs, including GPT-4V and Gemini-Pro-Vision.
  3. Open-source Contribution: The dataset and models, including their codes and checkpoints, are made publicly available, which could support further advancements in MLLMs with comprehensive aesthetic capabilities.

Methodology

The methodology involves three main stages:

  • Collecting Human Feedback: 48 subjects provided detailed aesthetic feedback on images, focusing on coarse and fine-grained aesthetic evaluations and feelings.
  • GPT-Assisted Refinement: The authors used GPT to generate diverse instruction-following data formats from the human feedback, enhancing the dataset's breadth.
  • Model Fine-tuning: The researchers fine-tuned pre-existing MLLMs on this comprehensive dataset to derive the AesExpert models, targeting improved aesthetic interactions.

Results

Extensive evaluations on the AesBench benchmark demonstrate substantial improvements in aesthetic perception tasks. The models exhibit notably enhanced capabilities in aesthetic interpretation, perception, and empathy. Performance improvements, especially in assessing artificial intelligence-generated images, highlight the dataset’s efficacy.

Implications and Future Directions

The work implies significant potential for MLLMs to serve roles requiring nuanced aesthetic understanding, such as in smart photography and media curation. Future advancements may explore expanding datasets to include more diverse aesthetic contexts and improve MLLMs' interpretative precision further.

Overall, this research lays foundational work for developing models that replicate human-like aesthetic judgments, catalyzing progress in AI that interacts with visual domains more profoundly.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. ArtEmis: Affective Language for Visual Art. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 11564–11574. https://doi.org/10.1109/CVPR46437.2021.01140
  2. VQA: Visual Question Answering. In Proc. IEEE Int. Conf. on Comput. Vis.
  3. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv preprint arXiv:2308.12966 (2023).
  4. BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models. arXiv preprint arXiv:2312.02896 (2023).
  5. MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023).
  6. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions. arXiv preprint arXiv:2311.12793 (2023).
  7. Scaling Instruction-Finetuned Language Models. arXiv preprint arXiv:2210.11416 (2022).
  8. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv preprint arXiv:2305.06500 (2023).
  9. Image Aesthetic Assessment: An experimental survey. IEEE Signal Process. Mag. 34, 4 (2017), 80–106. https://doi.org/10.1109/MSP.2017.2696576
  10. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. arXiv preprint arXiv:2103.10360 (2022).
  11. Perceptual Quality Assessment of Smartphone Photography. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 3674–3683. https://doi.org/10.1109/CVPR42600.2020.00373
  12. Google. 2023. Build with Gemini. https://ai.google.dev/
  13. The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs. arXiv preprint arXiv:2402.03757 (2024).
  14. Rethinking Image Aesthetics Assessment: Models, Datasets and Benchmarks. Proc. Int. Joint Conf. Artif. Intell. (Jul. 2022).
  15. CLIP knows image aesthetics. Frontiers in Artificial Intelligence 5 (2022), 976235.
  16. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR) 51, 6 (2019), 1–36.
  17. KonIQ-10k: An Ecologically Valid Database for Deep Learning of Blind Image Quality Assessment. IEEE Trans. Image Process. 29 (Jan. 2020), 4041–4056. https://doi.org/10.1109/TIP.2020.2967829
  18. Visual Hallucinations of Multi-modal Large Language Models. arXiv preprint arXiv:2402.14683 (2024).
  19. Explainable and Generalizable Blind Image Quality Assessment via Semantic Attribute Reasoning. IEEE Trans. Multimedia 25 (2023), 7672–7685. https://doi.org/10.1109/TMM.2022.3225728
  20. No-reference quality assessment for live broadcasting videos in temporal and spatial domains. IET Image Processing 14, 4 (2020), 774–781.
  21. Blind Quality Index of Depth Images Based on Structural Statistics for View Synthesis. IEEE Signal Process. Lett. 27 (Apr. 2020), 685–689. https://doi.org/10.1109/LSP.2020.2988830
  22. AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception. arXiv preprint arXiv:2401.08276 (2024).
  23. Huggingface. 2023. Introducing IDEFICS: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics
  24. ITU. 2012. Methodology for the Subjective Assessment of the Quality of Television Pictures. In Recommendation ITU-R BT.500-13. ITU.
  25. VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. 10041–10051. https://doi.org/10.1109/CVPR52729.2023.00968
  26. Photo aesthetics ranking network with attributes and content adaptation. In Proc. Eur. Conf. Comput. Vis. 662–679.
  27. Impressions: Understanding Visual Semiotics and Aesthetic Impact. arXiv preprint arXiv:2310.17887 (2023).
  28. LISA: Reasoning Segmentation via Large Language Model. arXiv preprint arXiv:2308.00692 (2023).
  29. Neural Abstract Style Transfer for Chinese Traditional Painting. arXiv preprint arXiv:1812.03264 (2018).
  30. Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv preprint arXiv:2305.03726 (2023).
  31. AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment. arXiv preprint arXiv:2306.04717 (2023).
  32. Theme-aware Visual Attribute Reasoning for Image Aesthetics Assessment. IEEE Trans. Circuits and Syst. Video Technol. (2023). https://doi.org/10.1109/TCSVT.2023.3249185
  33. Personality-assisted multi-task learning for generic and personalized image aesthetics assessment. IEEE Trans. Image Process. 29 (Jan. 2020), 3898–3910.
  34. Microsoft COCO: Common Objects in Context. arXiv preprint arXiv:1405.0312 (2015).
  35. Improved Baselines with Visual Instruction Tuning. arXiv preprint arXiv:2310.03744 (2023).
  36. Visual Instruction Tuning. arXiv preprint arXiv:2304.08485 (2023).
  37. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In Proc. Neural Inf. Process. Syst. 2507–2521.
  38. OCR-VQA: Visual Question Answering by Reading Text in Images. In 2019 International Conference on Document Analysis and Recognition (ICDAR). 947–952. https://doi.org/10.1109/ICDAR.2019.00156
  39. AVA: A large-scale database for aesthetic visual analysis. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. 2408–2415.
  40. OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023).
  41. Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint arXiv:2103.00020 (2021).
  42. AesCLIP: Multi-Attribute Contrastive Learning for Image Aesthetics Assessment. In Proc. ACM Int. Conf. Multimedia. 1117–1126.
  43. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971 (2023).
  44. Unsplash. 2023. Access the world’s largest open library dataset. https://unsplash.com/data
  45. DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models. arXiv preprint arXiv:2210.14896 (2023).
  46. Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision. arXiv preprint arXiv:2309.14181 (2024).
  47. Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models. arXiv preprint arXiv:2311.06783 (2023).
  48. Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels. arXiv preprint arXiv:2312.17090 (2023).
  49. Towards Open-ended Visual Quality Comparison.
  50. Personalized Image Aesthetics Assessment with Rich Attributes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. 19861–19869.
  51. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision). arXiv preprint arXiv:2309.17421 (2023).
  52. Multi-Level Transitional Contrast Learning for Personalized Image Aesthetics Assessment. IEEE Trans. Multimedia 26 (2024), 1944–1956.
  53. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv preprint arXiv:2304.14178 (2023).
  54. mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. arXiv preprint arXiv:2311.04257 (2023).
  55. Towards Artistic Image Aesthetics Assessment: a Large-scale Dataset and a New Method. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 22388–22397. https://doi.org/10.1109/CVPR52729.2023.02144
  56. TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones. arXiv preprint arXiv:2312.16862 (2023).
  57. Instruction Tuning for Large Language Models: A Survey. arXiv preprint arXiv:2308.10792 (2024).
  58. A Perceptual Quality Assessment Exploration for AIGC Images. arXiv preprint arXiv:2303.12618 (2023).
  59. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems 36 (2024).
  60. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yipo Huang (3 papers)
  2. Xiangfei Sheng (2 papers)
  3. Zhichao Yang (37 papers)
  4. Quan Yuan (37 papers)
  5. Zhichao Duan (8 papers)
  6. Pengfei Chen (52 papers)
  7. Leida Li (26 papers)
  8. Weisi Lin (118 papers)
  9. Guangming Shi (87 papers)
Citations (9)