How is Visual Attention Influenced by Text Guidance? Database and Model (2404.07537v2)
Abstract: The analysis and prediction of visual attention have long been crucial tasks in the fields of computer vision and image processing. In practical applications, images are generally accompanied by various text descriptions, however, few studies have explored the influence of text descriptions on visual attention, let alone developed visual saliency prediction models considering text guidance. In this paper, we conduct a comprehensive study on text-guided image saliency (TIS) from both subjective and objective perspectives. Specifically, we construct a TIS database named SJTU-TIS, which includes 1200 text-image pairs and the corresponding collected eye-tracking data. Based on the established SJTU-TIS database, we analyze the influence of various text descriptions on visual attention. Then, to facilitate the development of saliency prediction models considering text influence, we construct a benchmark for the established SJTU-TIS database using state-of-the-art saliency models. Finally, considering the effect of text descriptions on visual attention, while most existing saliency models ignore this impact, we further propose a text-guided saliency (TGSal) prediction model, which extracts and integrates both image features and text features to predict the image saliency under various text-description conditions. Our proposed model significantly outperforms the state-of-the-art saliency models on both the SJTU-TIS database and the pure image saliency databases in terms of various evaluation metrics. The SJTU-TIS database and the code of the proposed TGSal model will be released at: https://github.com/IntMeGroup/TGSal.
- H. Duan, W. Shen, X. Min, D. Tu, J. Li, and G. Zhai, “Saliency in augmented reality,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 6549–6558.
- A. Borji and L. Itti, “State-of-the-art in visual attention modeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 35, no. 1, pp. 185–207, 2012.
- F. Katsuki and C. Constantinidis, “Bottom-up and top-down attention: different processes and overlapping neural systems,” The Neuroscientist, vol. 20, no. 5, pp. 509–521, 2014.
- X. Ren, H. Duan, X. Min, Y. Zhu, W. Shen, L. Wang, F. Shi, L. Fan, X. Yang, and G. Zhai, “Where are the children with autism looking in reality?” in Proceedings of the CAAI International Conference on Artificial Intelligence (CICAI). Springer, 2022, pp. 588–600.
- H. Duan, X. Min, Y. Fang, L. Fan, X. Yang, and G. Zhai, “Visual attention analysis and prediction on human faces for children with autism spectrum disorder,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 15, no. 3s, pp. 1–23, 2019.
- H. Duan, G. Zhai, X. Min, Y. Fang, Z. Che, X. Yang, C. Zhi, H. Yang, and N. Liu, “Learning to predict where the children with asd look,” in Proceedings of the IEEE International Conference on Image Processing (ICIP). IEEE, 2018, pp. 704–708.
- H. Duan, G. Zhai, X. Min, Z. Che, Y. Fang, X. Yang, J. Gutiérrez, and P. L. Callet, “A dataset of eye movements for the children with autism spectrum disorder,” in Proceedings of the ACM Multimedia Systems Conference (ACM MMSys), 2019, pp. 255–260.
- X. Min, H. Duan, W. Sun, Y. Zhu, and G. Zhai, “Perceptual video quality assessment: A survey,” arXiv preprint arXiv:2402.03413, 2024.
- H. Duan, X. Zhu, Y. Zhu, X. Min, and G. Zhai, “A quick review of human perception in immersive media,” IEEE Open Journal on Immersive Displays, 2024.
- H. Duan, X. Min, Y. Zhu, G. Zhai, X. Yang, and P. Le Callet, “Confusing image quality assessment: Toward better augmented reality experience,” IEEE Transactions on Image Processing (TIP), vol. 31, pp. 7206–7221, 2022.
- Y. Zhu, X. Zhu, H. Duan, J. Li, K. Zhang, Y. Zhu, L. Chen, X. Min, and G. Zhai, “Audio-visual saliency for omnidirectional videos,” in International Conference on Image and Graphics. Springer, 2023, pp. 365–378.
- S. Yang, Q. Jiang, W. Lin, and Y. Wang, “Sgdnet: An end-to-end saliency-guided deep neural network for no-reference image quality assessment,” in Proceedings of the 27th ACM International Conference on Multimedia (ACM MM), 2019, pp. 1383–1391.
- Y. Zhu, G. Zhai, Y. Yang, H. Duan, X. Min, and X. Yang, “Viewing behavior supported visual saliency predictor for 360 degree videos,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 32, no. 7, pp. 4188–4201, 2021.
- Y. Fang, H. Duan, F. Shi, X. Min, and G. Zhai, “Identifying children with autism spectrum disorder based on gaze-following,” in Proceedings of the IEEE International Conference on Image Processing (ICIP). IEEE, 2020, pp. 423–427.
- Y. Cao, X. Min, W. Sun, and G. Zhai, “Subjective and objective audio-visual quality assessment for user generated content,” IEEE Transactions on Image Processing (TIP), 2023.
- Y. Cao, X. Min, W. Sun, and G. Zhai, “Attention-guided neural networks for full-reference and no-reference audio-visual quality assessment,” IEEE Transactions on Image Processing (TIP), vol. 32, pp. 1882–1896, 2023.
- Y. Gao, X. Min, Y. Zhu, J. Li, X.-P. Zhang, and G. Zhai, “Image quality assessment: From mean opinion score to opinion score distribution,” in Proceedings of the 30th ACM International Conference on Multimedia (ACM MM), 2022, pp. 997–1005.
- Y. Gao, X. Min, Y. Zhu, X.-P. Zhang, and G. Zhai, “Blind image quality assessment: A fuzzy neural network for opinion score distribution prediction,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2023.
- Y. Gao, X. Min, W. Zhu, X.-P. Zhang, and G. Zhai, “Image quality score distribution prediction via alpha stable model,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2022.
- S. Yang, W. Lin, G. Lin, Q. Jiang, and Z. Liu, “Progressive self-guided loss for salient object detection,” IEEE Transactions on Image Processing (TIP), vol. 30, pp. 8426–8438, 2021.
- R. Cong, J. Lei, H. Fu, M.-M. Cheng, W. Lin, and Q. Huang, “Review of visual saliency detection with comprehensive information,” IEEE Transactions on circuits and Systems for Video Technology (TCSVT), vol. 29, no. 10, pp. 2941–2959, 2018.
- M. Jiang, S. Huang, J. Duan, and Q. Zhao, “SALICON: Saliency in context,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1072–1080.
- T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning to predict where humans look,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2009, pp. 2106–2113.
- T. Judd, F. Durand, and A. Torralba, “A benchmark of computational models of saliency to predict human fixations,” 2012.
- A. Borji and L. Itti, “Cat2000: A large scale fixation dataset for boosting saliency research,” arXiv preprint arXiv:1505.03581, 2015.
- J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), vol. 19, 2006.
- S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware saliency detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 34, no. 10, pp. 1915–1926, 2011.
- J. Zhang and S. Sclaroff, “Saliency detection: A boolean map approach,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013, pp. 153–160.
- M. Kümmerer, L. Theis, and M. Bethge, “Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet,” Computer Science, 2014.
- X. Huang, C. Shen, X. Boix, and Q. Zhao, “SALICON: Reducing the semantic gap in saliency prediction by adapting deep neural networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 262–270.
- S. Jetley, N. Murray, and E. Vig, “End-to-end saliency mapping via probability distribution prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5753–5761.
- M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “A deep multi-level network for saliency prediction,” in Proceedings of the IEEE International Conference on Pattern Recognition (ICPR). IEEE, 2016, pp. 3488–3493.
- Z. Che, A. Borji, G. Zhai, S. Ling, J. Li, Y. Tian, G. Guo, and P. Le Callet, “Adversarial attack against deep saliency models powered by non-redundant priors,” IEEE Transactions on Image Processing (TIP), vol. 30, pp. 1973–1988, 2021.
- Q. Zhang, X. Wang, S. Wang, Z. Sun, S. Kwong, and J. Jiang, “Learning to explore saliency for stereoscopic videos via component-based interaction,” IEEE Transactions on Image Processing (TIP), vol. 29, pp. 5722–5736, 2020.
- S. Yang, G. Lin, Q. Jiang, and W. Lin, “A dilated inception network for visual saliency prediction,” IEEE Transactions on Multimedia (TMM), vol. 22, no. 8, pp. 2163–2176, 2019.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2014, pp. 740–755.
- B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2641–2649.
- Y. Sun, X. Min, H. Duan, and G. Zhai, “The influence of text-guidance on visual attention,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2023, pp. 1–5.
- L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 20, no. 11, pp. 1254–1259, 1998.
- N. Bruce and J. Tsotsos, “Saliency based on information maximization,” Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), vol. 18, 2005.
- M. Cerf, J. Harel, W. Einhäuser, and C. Koch, “Predicting human gaze using low-level saliency combined with face detection,” Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), vol. 20, 2007.
- E. Erdem and A. Erdem, “Visual saliency estimation by nonlinearly integrating features using region covariances,” Journal of Vision, vol. 13, no. 4, pp. 11–11, 2013.
- H. J. Seo and P. Milanfar, “Nonparametric bottom-up saliency detection by self-resemblance,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR). IEEE, 2009, pp. 45–52.
- J. Li, M. D. Levine, X. An, X. Xu, and H. He, “Visual saliency based on scale-space analysis in the frequency domain,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 35, no. 4, pp. 996–1010, 2012.
- S. S. Kruthiventi, K. Ayush, and R. V. Babu, “Deepfix: A fully convolutional neural network for predicting human eye fixations,” IEEE Transactions on Image Processing (TIP), vol. 26, no. 9, pp. 4446–4456, 2017.
- D. Tu, X. Min, H. Duan, G. Guo, G. Zhai, and W. Shen, “End-to-end human-gaze-target detection with transformers,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 2192–2200.
- M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “Predicting human eye fixations via an lstm-based saliency attentive model,” IEEE Transactions on Image Processing (TIP), vol. 27, no. 10, pp. 5142–5154, 2018.
- J. Pan, C. C. Ferrer, K. McGuinness, N. E. O’Connor, J. Torres, E. Sayrol, and X. Giro-i Nieto, “Salgan: Visual saliency prediction with generative adversarial networks,” arXiv preprint arXiv:1701.01081, 2017.
- Z. Che, A. Borji, G. Zhai, X. Min, G. Guo, and P. Le Callet, “How is gaze influenced by image transformations? dataset and model,” IEEE Transactions on Image Processing (TIP), vol. 29, pp. 2287–2300, 2019.
- A. Borji, “Saliency prediction in the deep learning era: Successes and limitations,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 43, no. 2, pp. 679–700, 2019.
- J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 9694–9705, 2021.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning (ICML). PMLR, 2021, pp. 8748–8763.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 684–10 695.
- Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. J. Lee, “Gligen: Open-set grounded text-to-image generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 22 511–22 521.
- M. Kang, J.-Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, and T. Park, “Scaling up gans for text-to-image synthesis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 10 124–10 134.
- Tobii pro. [Online]. Available: https://www.medicalexpo.com.cn/prod/tobii/product-125319-909357.html
- Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand, “What do different evaluation metrics tell us about saliency models?” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 41, no. 3, pp. 740–757, 2018.
- N. Riche, M. Duvinage, M. Mancas, B. Gosselin, and T. Dutoit, “Saliency and human fixations: State-of-the-art and study of comparison metrics,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013, pp. 1153–1160.
- M. Kümmerer, T. S. Wallis, and M. Bethge, “Information-theoretic model comparison unifies saliency metrics,” Proceedings of the National Academy of Sciences (PNAS), vol. 112, no. 52, pp. 16 054–16 059, 2015.
- J. Lou, H. Lin, D. Marshall, D. Saupe, and H. Liu, “Transalnet: Towards perceptually relevant visual saliency prediction,” Neurocomputing, vol. 494, pp. 455–467, 2022.
- B. Aydemir, L. Hoffstetter, T. Zhang, M. Salzmann, and S. Süsstrunk, “Tempsal-uncovering temporal information for deep saliency prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6461–6470.
- J. Xie, Z. Liu, G. Li, X. Lu, and T. Chen, “Global semantic-guided network for saliency prediction,” Knowledge-Based Systems, vol. 284, p. 111279, 2024.
- J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning (ICML). PMLR, 2022, pp. 12 888–12 900.
- R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 15 180–15 190.
- I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
- H. Duan, W. Shen, X. Min, D. Tu, L. Teng, J. Wang, and G. Zhai, “Masked autoencoders as image processors,” arXiv preprint arXiv:2303.17316, 2023.
- H. Duan, W. Shen, X. Min, Y. Tian, J.-H. Jung, X. Yang, and G. Zhai, “Develop then rival: A human vision-inspired framework for superimposed image decomposition,” IEEE Transactions on Multimedia (TMM), 2022.
- L. Zhang, M. H. Tong, T. K. Marks, H. Shan, and G. W. Cottrell, “Sun: A bayesian framework for saliency using natural statistics,” Journal of Vision, vol. 8, no. 7, pp. 32–32, 2008.
- X. Hou and L. Zhang, “Dynamic visual attention: Searching for coding length increments,” Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), vol. 21, 2008.
- B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp. 2921–2929.