Predicting Visual Attention in Graphic Design Documents (2407.02439v1)
Abstract: We present a model for predicting visual attention during the free viewing of graphic design documents. While existing works on this topic have aimed at predicting static saliency of graphic designs, our work is the first attempt to predict both spatial attention and dynamic temporal order in which the document regions are fixated by gaze using a deep learning based model. We propose a two-stage model for predicting dynamic attention on such documents, with webpages being our primary choice of document design for demonstration. In the first stage, we predict the saliency maps for each of the document components (e.g. logos, banners, texts, etc. for webpages) conditioned on the type of document layout. These component saliency maps are then jointly used to predict the overall document saliency. In the second stage, we use these layout-specific component saliency maps as the state representation for an inverse reinforcement learning model of fixation scanpath prediction during document viewing. To test our model, we collected a new dataset consisting of eye movements from 41 people freely viewing 450 webpages (the largest dataset of its kind). Experimental results show that our model outperforms existing models in both saliency and scanpath prediction for webpages, and also generalizes very well to other graphic design documents such as comics, posters, mobile UIs, etc. and natural images.
- Z. Bylinskii, N. W. Kim, P. O’Donovan, S. Alsheikh, S. Madan, H. Pfister, F. Durand, B. Russell, and A. Hertzmann, “Learning visual importance for graphic designs and data visualizations,” in Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology. ACM, 2017, pp. 57–69.
- C. Shen and Q. Zhao, “Webpage saliency,” in European conference on computer vision. Springer, 2014, pp. 33–46.
- J. Li, L. Su, B. Wu, J. Pang, C. Wang, Z. Wu, and Q. Huang, “Webpage saliency prediction with multi-features fusion,” in 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016, pp. 674–678.
- Y. Gu, J. Chang, Y. Zhang, and Y. Wang, “An element sensitive saliency model with position prior learning for web pages,” in Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence. ACM, 2019, pp. 157–161.
- C. Xia and R. Quan, “Predicting saccadic eye movements in free viewing of webpages,” IEEE Access, vol. 8, pp. 15 598–15 610, 2020.
- D. Lagun and M. Lalmas, “Understanding user attention and engagement in online news reading,” in Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, 2016, pp. 113–122.
- C. Kelton, J. Ryoo, A. Balasubramanian, and S. R. Das, “Improving user perceived page load times using gaze,” in 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). Boston, MA: USENIX Association, Mar. 2017, pp. 545–559. [Online]. Available: https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/kelton
- P. Faraday, “Visually critiquing web pages,” in Multimedia’99. Springer, 2000, pp. 155–166.
- J. M. Findlay, “Global visual processing for saccadic eye movements,” Vision research, vol. 22, no. 8, pp. 1033–1045, 1982.
- C. Fosco, V. Casser, A. K. Bedi, P. O’Donovan, A. Hertzmann, and Z. Bylinskii, “Predicting visual importance across graphic design types,” in Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, 2020, pp. 249–260.
- Z. Yang, L. Huang, Y. Chen, Z. Wei, S. Ahn, G. Zelinsky, D. Samaras, and M. Hoai, “Predicting goal-directed human attention using inverse reinforcement learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 193–202.
- M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “Predicting human eye fixations via an lstm-based saliency attentive model,” IEEE Transactions on Image Processing, vol. 27, no. 10, pp. 5142–5154, 2018.
- P. Gupta, S. Gupta, A. Jayagopal, S. Pal, and R. Sinha, “Saliency prediction for mobile user interfaces,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018, pp. 1529–1538.
- K. Bannier, E. Jain, and O. L. Meur, “Deepcomics: Saliency estimation for comics,” in Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications. ACM, 2018, p. 49.
- T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning to predict where humans look,” in 2009 IEEE 12th international conference on computer vision. IEEE, 2009, pp. 2106–2113.
- L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 11, pp. 1254–1259, 1998.
- J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in Advances in neural information processing systems, 2007, pp. 545–552.
- L. Zhang, M. H. Tong, T. K. Marks, H. Shan, and G. W. Cottrell, “Sun: A bayesian framework for saliency using natural statistics,” Journal of vision, vol. 8, no. 7, pp. 32–32, 2008.
- X. Hou and L. Zhang, “Saliency detection: A spectral residual approach,” in 2007 IEEE Conference on computer vision and pattern recognition. Ieee, 2007, pp. 1–8.
- J. Zhang and S. Sclaroff, “Saliency detection: A boolean map approach,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 153–160.
- J. Yang and M.-H. Yang, “Top-down visual saliency via joint crf and dictionary learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 3, pp. 576–588, 2016.
- C. Kanan, M. H. Tong, L. Zhang, and G. W. Cottrell, “Sun: Top-down saliency using natural statistics,” Visual cognition, vol. 17, no. 6-7, pp. 979–1003, 2009.
- A. Kocak, K. Cizmeciler, A. Erdem, and E. Erdem, “Top down saliency estimation via superpixel-based discriminative dictionaries.” in BMVC, 2014.
- V. Ramanishka, A. Das, J. Zhang, and K. Saenko, “Top-down visual saliency guided by captions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7206–7215.
- G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5455–5463.
- R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by multi-context deep learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1265–1274.
- L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Saliency detection with recurrent fully convolutional networks,” in European conference on computer vision. Springer, 2016, pp. 825–841.
- A. Borji, “Saliency prediction in the deep learning era: Successes and limitations,” IEEE transactions on pattern analysis and machine intelligence, 2019.
- M. Assens, X. Giro-i Nieto, K. McGuinness, and N. E. O’Connor, “Pathgan: visual scanpath prediction with generative adversarial networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.
- C. Xia, J. Han, F. Qi, and G. Shi, “Predicting human saccadic scanpaths based on iterative representation learning,” IEEE Transactions on Image Processing, vol. 28, no. 7, pp. 3502–3515, 2019.
- W. Sun, Z. Chen, and F. Wu, “Visual scanpath prediction using ior-roi recurrent mixture density network,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
- C. Shen, X. Huang, and Q. Zhao, “Predicting eye fixations on webpage with an ensemble of early features and high-level representations from deep network,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 2084–2093, 2015.
- Q. Zheng, J. Jiao, Y. Cao, and R. W. Lau, “Task-driven webpage saliency,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 287–302.
- B. Deka, Z. Huang, C. Franzen, J. Hibschman, D. Afergan, Y. Li, J. Nichols, and R. Kumar, “Rico: A mobile app dataset for building data-driven design applications,” in Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, ser. UIST ’17. New York, NY, USA: ACM, 2017, pp. 845–854.
- L. A. Leiva, Y. Xue, A. Bansal, H. R. Tavakoli, T. Köroðlu, J. Du, N. R. Dayama, and A. Oulasvirta, “Understanding visual saliency in mobile user interfaces,” in 22nd International Conference on Human-Computer Interaction with Mobile Devices and Services, 2020, pp. 1–12.
- M. Cerf, J. Harel, W. Einhäuser, and C. Koch, “Predicting human gaze using low-level saliency combined with face detection,” in Advances in neural information processing systems, 2008, pp. 241–248.
- S. He, H. R. Tavakoli, A. Borji, Y. Mi, and N. Pugeault, “Understanding and visualizing deep visual saliency models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 206–10 215.
- M. Kümmerer, T. S. Wallis, and M. Bethge, “Deepgaze ii: Reading fixations from deep features trained on object recognition,” arXiv preprint arXiv:1610.01563, 2016.
- S. Yang, G. Lin, Q. Jiang, and W. Lin, “A dilated inception network for visual saliency prediction,” IEEE Transactions on Multimedia, 2019.
- N. Liu and J. Han, “A deep spatial contextual long-term recurrent convolutional network for saliency detection,” IEEE Transactions on Image Processing, vol. 27, no. 7, pp. 3264–3274, 2018.
- S. Verdú, “Total variation distance and the distribution of relative information,” in 2014 Information Theory and Applications Workshop (ITA). IEEE, 2014, pp. 1–3.
- S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, “Textsnake: A flexible representation for detecting text of arbitrary shapes,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 20–36.
- S. Sharma, K. Shanmugasundaram, and S. K. Ramasamy, “Farec—cnn based efficient face recognition technique using dlib,” in 2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT). IEEE, 2016, pp. 192–195.
- Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, 2018, pp. 3–11.
- X. Pang, Y. Cao, R. W. H. Lau, and A. B. Chan, “Directing user attention via visual flow on web designs,” ACM Trans. Graph., vol. 35, no. 6, pp. 240:1–240:11, Nov. 2016.
- D. Jones, “Majestic million csv now free for all, daily,” 2012.
- O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
- D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035.
- L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
- A. Apon, F. Robinson, D. Brewer, L. Dowdy, D. Hoffman, and B. Lu, “Inital starting point analysis for k-means clustering: A case study,” 2006.
- J. P. Benway, “Banner blindness: The irony of attention grabbing on the world wide web,” in Proceedings of the Human Factors and Ergonomics Society Annual Meeting, vol. 42, no. 5. SAGE Publications Sage CA: Los Angeles, CA, 1998, pp. 463–467.
- G. Hervet, K. Guérard, S. Tremblay, and M. S. Chtourou, “Is banner blindness genuine? eye tracking internet text advertising,” Applied cognitive psychology, vol. 25, no. 5, pp. 708–716, 2011.
- J. Ho and S. Ermon, “Generative adversarial imitation learning,” in Advances in neural information processing systems, 2016, pp. 4565–4573.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
- I. Loshchilov and F. Hutter, “Fixing weight decay regularization in adam,” arXiv preprint arXiv:1711.05101, 2017.
- J. Pan, E. Sayrol, X. G.-i. Nieto, C. C. Ferrer, J. Torres, K. McGuinness, and N. E. O’Connor, “Salgan: Visual saliency prediction with adversarial networks,” in CVPR Scene Understanding Workshop (SUNw), 2017.
- W. Wang and J. Shen, “Deep visual attention prediction,” IEEE Transactions on Image Processing, vol. 27, no. 5, pp. 2368–2378, 2017.
- S. Jia and N. D. Bruce, “Eml-net: An expandable multi-layer network for saliency prediction,” Image and Vision Computing, vol. 95, p. 103887, 2020.
- Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand, “What do different evaluation metrics tell us about saliency models?” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 3, pp. 740–757, 2018.
- H. Adeli, F. Vitu, and G. J. Zelinsky, “A model of the superior colliculus predicts fixation locations during scene viewing and visual search,” Journal of Neuroscience, vol. 37, no. 6, pp. 1453–1467, 2017.
- L. Schwetlick, L. O. M. Rothkegel, H. A. Trukenbrod, and R. Engbert, “Modeling the effects of perisaccadic attention on gaze statistics during scene viewing,” Communications biology, vol. 3, no. 1, pp. 1–11, 2020.
- D. Zanca, S. Melacci, and M. Gori, “Gravitational laws of focus of attention,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 12, pp. 2983–2995, 2020.
- S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of molecular biology, vol. 48, no. 3, pp. 443–453, 1970.
- R. Dewhurst, M. Nyström, H. Jarodzka, T. Foulsham, R. Johansson, and K. Holmqvist, “It depends on how you look at it: Scanpath comparison in multiple dimensions with multimatch, a vector-based approach,” Behavior research methods, vol. 44, no. 4, pp. 1079–1100, 2012.
- A. Linardos, M. Kümmerer, O. Press, and M. Bethge, “Deepgaze iie: Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 919–12 928.
- S. Yao, X. Min, and G. Zhai, “Deep audio-visual fusion neural network for saliency estimation,” in 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2021, pp. 1604–1608.
- X. Min, G. Zhai, K. Gu, and X. Yang, “Fixation prediction through multimodal analysis,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 13, no. 1, pp. 1–23, 2016.
- X. Min, G. Zhai, J. Zhou, X.-P. Zhang, X. Yang, and X. Guan, “A multimodal saliency model for videos with high audio-visual correspondence,” IEEE Transactions on Image Processing, vol. 29, pp. 3805–3819, 2020.
- X. Min, G. Zhai, J. Zhou, M. C. Farias, and A. C. Bovik, “Study of subjective and objective quality assessment of audio-visual signals,” IEEE Transactions on Image Processing, vol. 29, pp. 6054–6068, 2020.
- X. Min, K. Gu, G. Zhai, X. Yang, W. Zhang, P. Le Callet, and C. W. Chen, “Screen content quality assessment: Overview, benchmark, and beyond,” ACM Computing Surveys (CSUR), vol. 54, no. 9, pp. 1–36, 2021.
- X. Min, K. Ma, K. Gu, G. Zhai, Z. Wang, and W. Lin, “Unified blind quality assessment of compressed natural, graphic, and screen content images,” IEEE Transactions on Image Processing, vol. 26, no. 11, pp. 5462–5474, 2017.
- Y. Fu, H. Zeug, Z. Ni, J. Chen, C. Cai, and K.-K. Ma, “Screen content image quality assessment using euclidean distance,” in 2017 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS). IEEE, 2017, pp. 44–49.
- W. Wang, J. Shen, F. Guo, M.-M. Cheng, and A. Borji, “Revisiting video saliency: A large-scale benchmark and a new model,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4894–4903.