Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers (2303.09383v3)

Published 16 Mar 2023 in cs.CV and cs.AI

Abstract: Most models of visual attention aim at predicting either top-down or bottom-up control, as studied using different visual search and free-viewing tasks. In this paper we propose the Human Attention Transformer (HAT), a single model that predicts both forms of attention control. HAT uses a novel transformer-based architecture and a simplified foveated retina that collectively create a spatio-temporal awareness akin to the dynamic visual working memory of humans. HAT not only establishes a new state-of-the-art in predicting the scanpath of fixations made during target-present and target-absent visual search and ``taskless'' free viewing, but also makes human gaze behavior interpretable. Unlike previous methods that rely on a coarse grid of fixation cells and experience information loss due to fixation discretization, HAT features a sequential dense prediction architecture and outputs a dense heatmap for each fixation, thus avoiding discretizing fixations. HAT sets a new standard in computational attention, which emphasizes effectiveness, generality, and interpretability. HAT's demonstrated scope and applicability will likely inspire the development of new attention models that can better predict human behavior in various attention-demanding scenarios. Code is available at https://github.com/cvlab-stonybrook/HAT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Pathgan: Visual scanpath prediction with generative adversarial networks. In ECCV Workshops, 2018.
  2. Saltinet: Scan-path prediction on 360 degree images using saliency volumes. In ICCV Workshops, 2017.
  3. Free viewing of dynamic stimuli by humans and monkeys. Journal of vision, 9(5):19–19, 2009.
  4. Anchoring visual search in scenes: Assessing the role of anchor objects on eye movements during visual search. Journal of vision, 18(13):11–11, 2018.
  5. State-of-the-art in visual attention modeling. IEEE transactions on pattern analysis and machine intelligence, 35(1):185–207, 2013.
  6. Analysis of scores, datasets, and models in visual saliency prediction. In ICCV, 2013.
  7. Salient object detection: A benchmark. IEEE transactions on image processing, 24(12):5706–5722, 2015.
  8. What do different evaluation metrics tell us about saliency models? IEEE transactions on pattern analysis and machine intelligence, 41(3):740–757, 2018.
  9. End-to-end object detection with transformers. In ECCV, 2020.
  10. Predicting human scanpaths in visual question answering. In CVPR, 2021a.
  11. Coco-search18 fixation dataset for predicting goal-directed attention control. Scientific reports, 11(1):1–11, 2021b.
  12. Characterizing target-absent human attention. In CVPR Workshops, 2022.
  13. Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.
  14. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
  15. Contextual cueing: Implicit learning and memory of visual context guides spatial attention. Cognitive psychology, 36(1):28–71, 1998.
  16. Predicting human eye fixations via an lstm-based saliency attentive model. IEEE Transactions on Image Processing, 27(10):5142–5154, 2018.
  17. Neural mechanisms of selective visual attention. Annual review of neuroscience, 18(1):193–222, 1995.
  18. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  19. Visual attention: The active vision perspective. In Vision and attention, pages 83–103. Springer, 2001.
  20. Top-down modulation: bridging selective attention and working memory. Trends in cognitive sciences, 16(2):129–135, 2012.
  21. Deep residual learning for image recognition. In CVPR, 2016.
  22. Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In ICCV, 2015.
  23. A saliency-based search mechanism for overt and covert shifts of visual attention. Vision research, 40(10-12):1489–1506, 2000.
  24. End-to-end saliency mapping via probability distribution prediction. In CVPR, 2016.
  25. Salicon: Saliency in context. In CVPR, 2015.
  26. Learning to predict where humans look. In ICCV, 2009.
  27. Deepfovea: Neural reconstruction for foveated rendering and video compression using learned statistics of natural videos. ACM Transactions on Graphics (TOG), 38(6):1–13, 2019.
  28. Deepfix: A fully convolutional neural network for predicting human eye fixations. IEEE Transactions on Image Processing, 26(9):4446–4456, 2017.
  29. State-of-the-art in human scanpath prediction. arXiv preprint arXiv:2102.12239, 2021.
  30. Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet. arXiv preprint arXiv:1411.1045, 2014.
  31. Understanding low-and high-level contributions to fixation prediction. In ICCV, 2017.
  32. Deepgaze iii: Modeling free-viewing human scanpaths with deep learning. Journal of Vision, 22(5):7–7, 2022.
  33. Looking and acting: vision and eye movements in natural behaviour. Oxford University Press, 2009.
  34. Cornernet: Detecting objects as paired keypoints. In ECCV, 2018.
  35. Learnable fourier features for multi-dimensional spatial positional encoding. In NeurIPS, 2021.
  36. Feature pyramid networks for object detection. In CVPR, 2017a.
  37. Focal loss for dense object detection. In ICCV, 2017b.
  38. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  39. Decoupled weight decay regularization. In ICLR, 2019.
  40. Object co-occurrence serves as a contextual cue to guide and facilitate visual search in a natural viewing environment. Journal of vision, 11(9):9–9, 2011.
  41. Everyone knows what is interesting: Salient locations which should be fixated. Journal of vision, 9(11):25–25, 2009.
  42. Gazeformer: Scalable, effective and fast prediction of goal-directed human attention. In CVPR, 2023.
  43. Optimal eye movement strategies in visual search. Nature, 434(7031):387–391, 2005.
  44. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443–453, 1970.
  45. Klaus Oberauer. Working memory and attention—a conceptual analysis and review. Journal of cognition, 2019.
  46. Attention for action in visual working memory. Cortex, 131:179–194, 2020.
  47. Advancing user quality of experience in 360-degree video streaming. In IFIP Networking, 2019.
  48. Mosaic: Advancing user quality of experience in 360-degree video streaming with machine learning. IEEE Transactions on Network and Service Management, 18(1):1000–1015, 2021.
  49. Vision transformers for dense prediction. In ICCV, 2021.
  50. Optimal visual search based on a model of target detectability in natural images. In NeurIPS, 2020.
  51. Visual scanpath prediction using ior-roi recurrent mixture density network. IEEE transactions on pattern analysis and machine intelligence, 43(6):2101–2118, 2019.
  52. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological review, 113(4):766, 2006.
  53. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
  54. Attention is all you need. In NeurIPS, 2017.
  55. Deep visual attention prediction. IEEE Transactions on Image Processing, 27(5):2368–2378, 2017.
  56. Revisiting video saliency prediction in the deep learning era. IEEE transactions on pattern analysis and machine intelligence, 43(1):220–237, 2019.
  57. Searching for inhibition of return in visual search: A review. Vision research, 50(2):220–228, 2010.
  58. JM Wolfe. Visual search. pashler, h.(ed.), attention, 1998.
  59. Five factors that guide attention in visual search. Nature Human Behaviour, 1(3):1–8, 2017.
  60. Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
  61. Predicting human gaze beyond pixels. Journal of vision, 14(1):28–28, 2014.
  62. Predicting goal-directed human attention using inverse reinforcement learning. In CVPR, 2020.
  63. Target-absent human attention. In ECCV, 2022.
  64. AL Yarbus. Eye movements and vision plenum. New York, 1967.
  65. Gregory Zelinsky. A theory of eye movements during target acquisition. Psychological review, 115(4):787, 2008.
  66. Benchmarking gaze prediction for categorical visual search. In CVPR Workshops, 2019.
  67. Predicting goal-directed attention control using inverse-reinforcement learning. Neurons, Behavior, Data analysis, and Theory, 5(2):1–9, 2021.
  68. Finding any waldo with zero-shot invariant and efficient visual search. Nature communications, 9(1):1–15, 2018.
  69. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2021.
Citations (7)

Summary

We haven't generated a summary for this paper yet.