UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark (2404.09619v1)
Abstract: As an alternative to expensive expert evaluation, Image Aesthetic Assessment (IAA) stands out as a crucial task in computer vision. However, traditional IAA methods are typically constrained to a single data source or task, restricting the universality and broader application. In this work, to better align with human aesthetics, we propose a Unified Multi-modal Image Aesthetic Assessment (UNIAA) framework, including a Multi-modal LLM (MLLM) named UNIAA-LLaVA and a comprehensive benchmark named UNIAA-Bench. We choose MLLMs with both visual perception and language ability for IAA and establish a low-cost paradigm for transforming the existing datasets into unified and high-quality visual instruction tuning data, from which the UNIAA-LLaVA is trained. To further evaluate the IAA capability of MLLMs, we construct the UNIAA-Bench, which consists of three aesthetic levels: Perception, Description, and Assessment. Extensive experiments validate the effectiveness and rationality of UNIAA. UNIAA-LLaVA achieves competitive performance on all levels of UNIAA-Bench, compared with existing MLLMs. Specifically, our model performs better than GPT-4V in aesthetic perception and even approaches the junior-level human. We find MLLMs have great potential in IAA, yet there remains plenty of room for further improvement. The UNIAA-LLaVA and UNIAA-Bench will be released.
- Image aesthetic assessment: An experimental survey. IEEE Signal Processing Magazine 34, 4 (2017), 80–106.
- Distilling knowledge from object classification to aesthetics assessment. IEEE Transactions on Circuits and Systems for Video Technology 32, 11 (2022), 7386–7402.
- Comment-guided semantics-aware image aesthetics assessment. IEEE Transactions on Circuits and Systems for Video Technology 33, 3 (2022), 1487–1492.
- Infrared and visible cross-modal image retrieval through shared features. IEEE Transactions on circuits and systems for video technology 31, 11 (2021), 4485–4496.
- Multigranular event recognition of personal photo albums. IEEE Transactions on Multimedia 20, 7 (2017), 1837–1847.
- Yogesh Singh Rawat and Mohan S Kankanhalli. 2016. ClickSmart: A context-aware viewpoint recommendation system for mobile photography. IEEE Transactions on Circuits and Systems for Video Technology 27, 1 (2016), 149–158.
- Roundness-preserving warping for aesthetic enhancement-based stereoscopic image editing. IEEE Transactions on Circuits and Systems for Video Technology 31, 4 (2020), 1463–1477.
- A comprehensive survey on image aesthetic quality assessment. In 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS). IEEE, 294–299.
- AVA: A large-scale database for aesthetic visual analysis. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2408–2415.
- Photo aesthetics ranking network with attributes and content adaptation. In European conference on computer vision. Springer, 662–679.
- Personalized Image Aesthetics Assessment With Rich Attributes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19861–19869.
- Rethinking Image Aesthetics Assessment: Models, Datasets and Benchmarks. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Lud De Raedt (Ed.). International Joint Conferences on Artificial Intelligence Organization, 942–948. https://doi.org/10.24963/ijcai.2022/132 Main Track.
- Aesthetic attributes assessment of images. In Proceedings of the 27th ACM international conference on multimedia. 311–319.
- Image Aesthetics Assessment via Learnable Queries. arXiv preprint arXiv:2309.02861 (2023).
- EAT: An Enhancer for Aesthetics-Oriented Transformers. In Proceedings of the 31st ACM International Conference on Multimedia. 1023–1032.
- Hossein Talebi and Peyman Milanfar. 2018. NIMA: Neural image assessment. IEEE transactions on image processing 27, 8 (2018), 3998–4011.
- Adaptive Fractional Dilated Convolution Network for Image Aesthetics Assessment. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Effective Aesthetics Prediction With Multi-Level Spatially Pooled Features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- MUSIQ: Multi-Scale Image Quality Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 5148–5157.
- Vision transformer with deformable attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4794–4803.
- Thinking image color aesthetics assessment: Models, datasets and benchmarks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 21838–21847.
- VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10041–10051.
- Understanding Aesthetics with Language: A Photo Critique Dataset for Aesthetic Assessment. Advances in Neural Information Processing Systems 35 (2022), 34148–34161.
- A model of aesthetic appreciation and aesthetic judgments. British journal of psychology 95, 4 (2004), 489–508.
- Anjan Chatterjee. 2004. Prospects for a cognitive neuroscience of visual aesthetics. Bulletin of Psychology and the Arts 4, 2 (2004), 56–60.
- OpenAI. 2023. GPT-4V(ision) System Card. https://api.semanticscholar.org/CorpusID:263218031.
- Flamingo: a Visual Language Model for Few-Shot Learning. ArXiv abs/2204.14198 (2022).
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In ICML.
- PaLM-E: An Embodied Multimodal Language Model. In ICML.
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592 (2023).
- Visual Instruction Tuning.
- Language Is Not All You Need: Aligning Perception with Language Models. ArXiv abs/2302.14045 (2023).
- VideoChat: Chat-Centric Video Understanding. ArXiv abs/2305.06355 (2023).
- Are emergent abilities of Large Language Models a mirage? arXiv preprint arXiv:2304.15004 (2023).
- Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023).
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023).
- Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision. arXiv:2309.14181 [cs.CV]
- Theme-aware Visual Attribute Reasoning for Image Aesthetics Assessment. IEEE Transactions on Circuits and Systems for Video Technology (2023).
- Recognizing Image Style. In Proceedings of the British Machine Vision Conference. BMVA Press. https://doi.org/10.5244/C.28.122
- Content-based photo quality assessment. IEEE Transactions on Multimedia 15, 8 (2013), 1930–1943.
- Personalized Image Aesthetics. In The IEEE International Conference on Computer Vision (ICCV).
- Photographic composition classification and dominant geometric element detection for outdoor scenes. Journal of Visual Communication and Image Representation 55 (2018), 91–105.
- Composition and style attributes guided image aesthetic assessment. IEEE Transactions on Image Processing 31 (2022), 5009–5024.
- Joint image and text representation for aesthetics analysis. In Proceedings of the 24th ACM international conference on Multimedia. 262–266.
- Neural aesthetic image reviewer. IET Computer Vision 13, 8 (2019), 749–758.
- Learning Transferable Visual Models From Natural Language Supervision.
- Aesthetic critiques generation for photos. In Proceedings of the IEEE international conference on computer vision. 3514–3523.
- EVA: An Explainable Visual Aesthetics Dataset. In Joint Workshop on Aesthetic and Technical Quality Assessment of Multimedia and Media Analytics for Societal Trends (ATQAM/MAST’20), ACM Multimedia. ACM, Seattle, United States, 5–13. https://doi.org/10.1145/3423268.3423590
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- Scaling Instruction-Finetuned Language Models. https://doi.org/10.48550/ARXIV.2210.11416
- LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
- Otter: A Multi-Modal Model with In-Context Instruction Tuning. ArXiv abs/2305.03726 (2023).
- LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. arXiv preprint arXiv:2304.15010 (2023).
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500 [cs.CV]
- InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition. arXiv:2309.15112 [cs.CV]
- Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv:1504.00325 [cs.CV]
- nocaps: novel object captioning at scale. In ICCV.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.
- VQA: Visual Question Answering. In ICCV.
- OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR).
- A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge. arXiv (2022).
- MMBench: Is Your Multi-modal Model an All-around Player? ArXiv abs/2307.06281 (2023).
- MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv:2306.13394 [cs.CV]
- SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension. arXiv:2307.16125 [cs.CL]
- Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122 (2023).
- Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947 (2024).
- Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852 (2023).
- Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding. arXiv preprint arXiv:2311.08046 (2023).
- Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023).
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890 (2023).
- Q-instruct: Improving low-level visual abilities for multi-modality foundation models. arXiv preprint arXiv:2311.06783 (2023).
- Drivegpt4: Interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412 (2023).
- Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023).
- OpenAI. 2023a. ChatGPT (June 13 version). https://chat.openai.com/chat Large language model.
- OpenAI. 2023b. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023).
- LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. arXiv:2111.02114 [cs.CV]
- Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In ACL.
- Modeling Context in Referring Expressions. arXiv:1608.00272 [cs.CV]
- Modeling image composition for visual aesthetic assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 0–0.
- Probing Sentiment-Oriented Pre-Training Inspired by Human Sentiment Perception Mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2850–2860.
- EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models. arXiv preprint arXiv:2312.06722 (2023).
- Towards explainable in-the-wild video quality assessment: a database and a language-prompted approach. In Proceedings of the 31st ACM International Conference on Multimedia. 1045–1054.
- B Series. 2012. Methodology for the subjective assessment of the quality of television pictures. Recommendation ITU-R BT 500, 13 (2012).
- Personalized image aesthetics. In Proceedings of the IEEE international conference on computer vision. 638–647.
- A-lamp: Adaptive layout-aware multi-patch deep convolutional neural network for photo aesthetic assessment. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4535–4544.
- Hierarchical layout-aware graph convolutional network for unified aesthetics assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8475–8484.
- MaxViT: Multi-Axis Vision Transformer. European conference on computer vision (2022).
- Zhaokun Zhou (22 papers)
- Qiulin Wang (12 papers)
- Bin Lin (33 papers)
- Yiwei Su (1 paper)
- Rui Chen (310 papers)
- Xin Tao (50 papers)
- Amin Zheng (4 papers)
- Li Yuan (141 papers)
- Pengfei Wan (86 papers)
- Di Zhang (230 papers)