Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark (2404.09619v1)

Published 15 Apr 2024 in cs.CV and cs.AI

Abstract: As an alternative to expensive expert evaluation, Image Aesthetic Assessment (IAA) stands out as a crucial task in computer vision. However, traditional IAA methods are typically constrained to a single data source or task, restricting the universality and broader application. In this work, to better align with human aesthetics, we propose a Unified Multi-modal Image Aesthetic Assessment (UNIAA) framework, including a Multi-modal LLM (MLLM) named UNIAA-LLaVA and a comprehensive benchmark named UNIAA-Bench. We choose MLLMs with both visual perception and language ability for IAA and establish a low-cost paradigm for transforming the existing datasets into unified and high-quality visual instruction tuning data, from which the UNIAA-LLaVA is trained. To further evaluate the IAA capability of MLLMs, we construct the UNIAA-Bench, which consists of three aesthetic levels: Perception, Description, and Assessment. Extensive experiments validate the effectiveness and rationality of UNIAA. UNIAA-LLaVA achieves competitive performance on all levels of UNIAA-Bench, compared with existing MLLMs. Specifically, our model performs better than GPT-4V in aesthetic perception and even approaches the junior-level human. We find MLLMs have great potential in IAA, yet there remains plenty of room for further improvement. The UNIAA-LLaVA and UNIAA-Bench will be released.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. Image aesthetic assessment: An experimental survey. IEEE Signal Processing Magazine 34, 4 (2017), 80–106.
  2. Distilling knowledge from object classification to aesthetics assessment. IEEE Transactions on Circuits and Systems for Video Technology 32, 11 (2022), 7386–7402.
  3. Comment-guided semantics-aware image aesthetics assessment. IEEE Transactions on Circuits and Systems for Video Technology 33, 3 (2022), 1487–1492.
  4. Infrared and visible cross-modal image retrieval through shared features. IEEE Transactions on circuits and systems for video technology 31, 11 (2021), 4485–4496.
  5. Multigranular event recognition of personal photo albums. IEEE Transactions on Multimedia 20, 7 (2017), 1837–1847.
  6. Yogesh Singh Rawat and Mohan S Kankanhalli. 2016. ClickSmart: A context-aware viewpoint recommendation system for mobile photography. IEEE Transactions on Circuits and Systems for Video Technology 27, 1 (2016), 149–158.
  7. Roundness-preserving warping for aesthetic enhancement-based stereoscopic image editing. IEEE Transactions on Circuits and Systems for Video Technology 31, 4 (2020), 1463–1477.
  8. A comprehensive survey on image aesthetic quality assessment. In 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS). IEEE, 294–299.
  9. AVA: A large-scale database for aesthetic visual analysis. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2408–2415.
  10. Photo aesthetics ranking network with attributes and content adaptation. In European conference on computer vision. Springer, 662–679.
  11. Personalized Image Aesthetics Assessment With Rich Attributes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19861–19869.
  12. Rethinking Image Aesthetics Assessment: Models, Datasets and Benchmarks. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Lud De Raedt (Ed.). International Joint Conferences on Artificial Intelligence Organization, 942–948. https://doi.org/10.24963/ijcai.2022/132 Main Track.
  13. Aesthetic attributes assessment of images. In Proceedings of the 27th ACM international conference on multimedia. 311–319.
  14. Image Aesthetics Assessment via Learnable Queries. arXiv preprint arXiv:2309.02861 (2023).
  15. EAT: An Enhancer for Aesthetics-Oriented Transformers. In Proceedings of the 31st ACM International Conference on Multimedia. 1023–1032.
  16. Hossein Talebi and Peyman Milanfar. 2018. NIMA: Neural image assessment. IEEE transactions on image processing 27, 8 (2018), 3998–4011.
  17. Adaptive Fractional Dilated Convolution Network for Image Aesthetics Assessment. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  18. Effective Aesthetics Prediction With Multi-Level Spatially Pooled Features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  19. MUSIQ: Multi-Scale Image Quality Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 5148–5157.
  20. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4794–4803.
  21. Thinking image color aesthetics assessment: Models, datasets and benchmarks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 21838–21847.
  22. VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10041–10051.
  23. Understanding Aesthetics with Language: A Photo Critique Dataset for Aesthetic Assessment. Advances in Neural Information Processing Systems 35 (2022), 34148–34161.
  24. A model of aesthetic appreciation and aesthetic judgments. British journal of psychology 95, 4 (2004), 489–508.
  25. Anjan Chatterjee. 2004. Prospects for a cognitive neuroscience of visual aesthetics. Bulletin of Psychology and the Arts 4, 2 (2004), 56–60.
  26. OpenAI. 2023. GPT-4V(ision) System Card. https://api.semanticscholar.org/CorpusID:263218031.
  27. Flamingo: a Visual Language Model for Few-Shot Learning. ArXiv abs/2204.14198 (2022).
  28. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In ICML.
  29. PaLM-E: An Embodied Multimodal Language Model. In ICML.
  30. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592 (2023).
  31. Visual Instruction Tuning.
  32. Language Is Not All You Need: Aligning Perception with Language Models. ArXiv abs/2302.14045 (2023).
  33. VideoChat: Chat-Centric Video Understanding. ArXiv abs/2305.06355 (2023).
  34. Are emergent abilities of Large Language Models a mirage? arXiv preprint arXiv:2304.15004 (2023).
  35. Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023).
  36. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023).
  37. Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision. arXiv:2309.14181 [cs.CV]
  38. Theme-aware Visual Attribute Reasoning for Image Aesthetics Assessment. IEEE Transactions on Circuits and Systems for Video Technology (2023).
  39. Recognizing Image Style. In Proceedings of the British Machine Vision Conference. BMVA Press. https://doi.org/10.5244/C.28.122
  40. Content-based photo quality assessment. IEEE Transactions on Multimedia 15, 8 (2013), 1930–1943.
  41. Personalized Image Aesthetics. In The IEEE International Conference on Computer Vision (ICCV).
  42. Photographic composition classification and dominant geometric element detection for outdoor scenes. Journal of Visual Communication and Image Representation 55 (2018), 91–105.
  43. Composition and style attributes guided image aesthetic assessment. IEEE Transactions on Image Processing 31 (2022), 5009–5024.
  44. Joint image and text representation for aesthetics analysis. In Proceedings of the 24th ACM international conference on Multimedia. 262–266.
  45. Neural aesthetic image reviewer. IET Computer Vision 13, 8 (2019), 749–758.
  46. Learning Transferable Visual Models From Natural Language Supervision.
  47. Aesthetic critiques generation for photos. In Proceedings of the IEEE international conference on computer vision. 3514–3523.
  48. EVA: An Explainable Visual Aesthetics Dataset. In Joint Workshop on Aesthetic and Technical Quality Assessment of Multimedia and Media Analytics for Societal Trends (ATQAM/MAST’20), ACM Multimedia. ACM, Seattle, United States, 5–13. https://doi.org/10.1145/3423268.3423590
  49. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  50. Scaling Instruction-Finetuned Language Models. https://doi.org/10.48550/ARXIV.2210.11416
  51. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
  52. Otter: A Multi-Modal Model with In-Context Instruction Tuning. ArXiv abs/2305.03726 (2023).
  53. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. arXiv preprint arXiv:2304.15010 (2023).
  54. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500 [cs.CV]
  55. InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition. arXiv:2309.15112 [cs.CV]
  56. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv:1504.00325 [cs.CV]
  57. nocaps: novel object captioning at scale. In ICCV.
  58. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.
  59. VQA: Visual Question Answering. In ICCV.
  60. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR).
  61. A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge. arXiv (2022).
  62. MMBench: Is Your Multi-modal Model an All-around Player? ArXiv abs/2307.06281 (2023).
  63. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv:2306.13394 [cs.CV]
  64. SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension. arXiv:2307.16125 [cs.CL]
  65. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122 (2023).
  66. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947 (2024).
  67. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852 (2023).
  68. Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding. arXiv preprint arXiv:2311.08046 (2023).
  69. Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023).
  70. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890 (2023).
  71. Q-instruct: Improving low-level visual abilities for multi-modality foundation models. arXiv preprint arXiv:2311.06783 (2023).
  72. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412 (2023).
  73. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023).
  74. OpenAI. 2023a. ChatGPT (June 13 version). https://chat.openai.com/chat Large language model.
  75. OpenAI. 2023b. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023).
  76. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. arXiv:2111.02114 [cs.CV]
  77. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In ACL.
  78. Modeling Context in Referring Expressions. arXiv:1608.00272 [cs.CV]
  79. Modeling image composition for visual aesthetic assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 0–0.
  80. Probing Sentiment-Oriented Pre-Training Inspired by Human Sentiment Perception Mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2850–2860.
  81. EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models. arXiv preprint arXiv:2312.06722 (2023).
  82. Towards explainable in-the-wild video quality assessment: a database and a language-prompted approach. In Proceedings of the 31st ACM International Conference on Multimedia. 1045–1054.
  83. B Series. 2012. Methodology for the subjective assessment of the quality of television pictures. Recommendation ITU-R BT 500, 13 (2012).
  84. Personalized image aesthetics. In Proceedings of the IEEE international conference on computer vision. 638–647.
  85. A-lamp: Adaptive layout-aware multi-patch deep convolutional neural network for photo aesthetic assessment. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4535–4544.
  86. Hierarchical layout-aware graph convolutional network for unified aesthetics assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8475–8484.
  87. MaxViT: Multi-Axis Vision Transformer. European conference on computer vision (2022).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Zhaokun Zhou (22 papers)
  2. Qiulin Wang (12 papers)
  3. Bin Lin (33 papers)
  4. Yiwei Su (1 paper)
  5. Rui Chen (310 papers)
  6. Xin Tao (50 papers)
  7. Amin Zheng (4 papers)
  8. Li Yuan (141 papers)
  9. Pengfei Wan (86 papers)
  10. Di Zhang (230 papers)
Citations (4)