Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-modal Learnable Queries for Image Aesthetics Assessment (2405.01326v1)

Published 2 May 2024 in cs.CV

Abstract: Image aesthetics assessment (IAA) is attracting wide interest with the prevalence of social media. The problem is challenging due to its subjective and ambiguous nature. Instead of directly extracting aesthetic features solely from the image, user comments associated with an image could potentially provide complementary knowledge that is useful for IAA. With existing large-scale pre-trained models demonstrating strong capabilities in extracting high-quality transferable visual and textual features, learnable queries are shown to be effective in extracting useful features from the pre-trained visual features. Therefore, in this paper, we propose MMLQ, which utilizes multi-modal learnable queries to extract aesthetics-related features from multi-modal pre-trained features. Extensive experimental results demonstrate that MMLQ achieves new state-of-the-art performance on multi-modal IAA, beating previous methods by 7.7% and 8.3% in terms of SRCC and PLCC, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. L. Li, D. Liang, Y. Gao, S.-J. Huang, and S. Chen, “All-e: Aesthetics-guided low-light image enhancement,” arXiv preprint arXiv:2304.14610, 2023.
  2. J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, and Y. Dong, “Imagereward: Learning and evaluating human preferences for text-to-image generation,” arXiv preprint arXiv:2304.05977, 2023.
  3. X. Lu, Z. Lin, H. Jin, J. Yang, and J. Z. Wang, “Rapid: Rating pictorial aesthetics using deep learning,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 457–466.
  4. V. Hosu, B. Goldlucke, and D. Saupe, “Effective aesthetics prediction with multi-level spatially pooled features,” in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9375–9383.
  5. D. She, Y.-K. Lai, G. Yi, and K. Xu, “Hierarchical layout-aware graph convolutional network for unified aesthetics assessment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8475–8484.
  6. S. He, Y. Zhang, R. Xie, D. Jiang, and A. Ming, “Rethinking image aesthetics assessment: Models, datasets and benchmarks,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, 2022, pp. 942–948.
  7. N. Murray, L. Marchesotti, and F. Perronnin, “Ava: A large-scale database for aesthetic visual analysis,” in Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’22), 2012, pp. 2408–2415.
  8. Y. Zhou, X. Lu, J. Zhang, and J. Z. Wang, “Joint image and text representation for aesthetics analysis,” in Proceedings of the 24th ACM international conference on Multimedia, 2016, pp. 262–266.
  9. A. Radford et al., “Learning transferable visual models from natural language supervision,” in Proceedings of the 2021 International Conference on Machine Learning (ICML’21), 2021, pp. 8748–8763.
  10. J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
  11. P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Computing Surveys, vol. 55, no. 9, pp. 1–35, 2023.
  12. D. Ng, C. Zhang, R. Zhang, Y. Ma, F. Ritter-Gutierrez, T. H. Nguyen, C. Ni, S. Zhao, E. S. Chng, and B. Ma, “Are soft prompts good zero-shot learners for speech recognition?” arXiv preprint arXiv:2309.09413, 2023.
  13. R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Studying aesthetics in photographic images using a computational approach,” in Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part III 9.   Springer, 2006, pp. 288–301.
  14. Y. Ke, X. Tang, and F. Jing, “The design of high-level features for photo quality assessment,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 1.   IEEE, 2006, pp. 419–426.
  15. J. Hou, H. Ding, W. Lin, W. Liu, and Y. Fang, “Distilling knowledge from object classification to aesthetics assessment,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 11, pp. 7386–7402, 2022.
  16. Z. Xiong, H. Yu, and Z. Shen, “Federated learning for personalized image aesthetics assessment,” in 2023 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2023, pp. 336–341.
  17. Z. Xiong, Y. Zhang, Z. Shen, P. Ren, and H. Yu, “Image aesthetics assessment via learnable queries,” arXiv preprint arXiv:2309.02861, 2023.
  18. T. Zhu, L. Li, P. Chen, J. Wu, Y. Yang, Y. Li, and Y. Guo, “Attribute-assisted multimodal network for image aesthetics assessment,” in 2023 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2023, pp. 2477–2482.
  19. L. Li, T. Zhu, P. Chen, Y. Yang, Y. Li, and W. Lin, “Image aesthetics assessment with attribute-assisted multimodal memory network,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  20. D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
  21. H. Talebi and P. Milanfar, “Nima: Neural image assessment,” IEEE transactions on image processing, vol. 27, no. 8, pp. 3998–4011, 2018.
  22. Q. Chen, W. Zhang, N. Zhou, P. Lei, Y. Xu, Y. Zheng, and J. Fan, “Adaptive fractional dilated convolution network for image aesthetics assessment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14 114–14 123.
  23. J. Hou, W. Lin, Y. Fang, H. Wu, C. Chen, L. Liao, and W. Liu, “Towards transparent deep image aesthetics assessment with tag-based content descriptors,” IEEE Transactions on Image Processing, 2023.
  24. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  25. Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y. Cao, “Eva: Exploring the limits of masked visual representation learning at scale,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 358–19 369.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhiwei Xiong (83 papers)
  2. Yunfan Zhang (19 papers)
  3. Zhiqi Shen (62 papers)
  4. Peiran Ren (28 papers)
  5. Han Yu (218 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com