Unifying Global-Local Representations in Salient Object Detection with Transformer (2108.02759v2)
Abstract: The fully convolutional network (FCN) has dominated salient object detection for a long period. However, the locality of CNN requires the model deep enough to have a global receptive field and such a deep model always leads to the loss of local details. In this paper, we introduce a new attention-based encoder, vision transformer, into salient object detection to ensure the globalization of the representations from shallow to deep layers. With the global view in very shallow layers, the transformer encoder preserves more local representations to recover the spatial details in final saliency maps. Besides, as each layer can capture a global view of its previous layer, adjacent layers can implicitly maximize the representation differences and minimize the redundant features, making that every output feature of transformer layers contributes uniquely for final prediction. To decode features from the transformer, we propose a simple yet effective deeply-transformed decoder. The decoder densely decodes and upsamples the transformer features, generating the final saliency map with less noise injection. Experimental results demonstrate that our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks by a large margin, with an average of 12.17% improvement in terms of Mean Absolute Error (MAE). Code will be available at https://github.com/OliverRensu/GLSTR.
- X. Zhao, Y. Pang, L. Zhang, H. Lu, and L. Zhang, “Suppress and balance: A simple gated network for salient object detection,” in ECCV, 2020, pp. 35–51.
- R. Mechrez, E. Shechtman, and L. Zelnik-Manor, “Saliency driven image manipulation,” Machine Vision and Applications, vol. 30, no. 2, pp. 189–202, 2019.
- W. Luo, M. Yang, and W. Zheng, “Weakly-supervised semantic segmentation with saliency and incremental supervision updating,” Pattern Recognition, p. 107858, 2021.
- W. Wang, G. Sun, and L. Van Gool, “Looking beyond single images for weakly supervised semantic segmentation learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- W. Wang, J. Shen, R. Yang, and F. Porikli, “Saliency-aware video object segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 1, pp. 20–33, 2017.
- C. Craye, D. Filliat, and J.-F. Goudou, “Environment exploration for object-based visual saliency learning,” in ICRA. IEEE, 2016, pp. 2303–2309.
- Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, and S. Z. Li, “Salient color names for person re-identification,” in ECCV. Springer, 2014, pp. 536–551.
- R. Zhao, W. Oyang, and X. Wang, “Person re-identification by saliency learning,” IEEE TPAMI, vol. 39, no. 2, pp. 356–370, 2016.
- W. Wang, J. Shen, and H. Ling, “A deep network solution for attention and aesthetics aware photo cropping,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 7, pp. 1531–1544, 2018.
- M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu, “Global contrast based salient region detection,” IEEE TPAMI, vol. 37, no. 3, pp. 569–582, 2014.
- C. Gong, D. Tao, W. Liu, S. J. Maybank, M. Fang, K. Fu, and J. Yang, “Saliency propagation from simple to difficult,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2531–2539.
- W. Einhäuser and P. König, “Does luminance-contrast contribute to a saliency map for overt visual attention?” European Journal of Neuroscience, vol. 17, no. 5, pp. 1089–1097, 2003.
- S. He and R. W. Lau, “Saliency detection with flash and no-flash image pairs,” in ECCV. Springer, 2014, pp. 110–124.
- K. Fu, C. Gong, I. Y.-H. Gu, and J. Yang, “Normalized cut-based saliency detection by adaptive multi-level region merging,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5671–5683, 2015.
- J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015, pp. 3431–3440.
- G. Li and Y. Yu, “Deep contrast learning for salient object detection,” in CVPR, 2016, pp. 478–487.
- T. Wang, A. Borji, L. Zhang, P. Zhang, and H. Lu, “A stagewise refinement model for detecting salient objects in images,” in ICCV, 2017, pp. 4019–4028.
- L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Saliency detection with recurrent fully convolutional networks,” in ECCV. Springer, 2016, pp. 825–841.
- L. Zhang, J. Dai, H. Lu, Y. He, and G. Wang, “A bi-directional message passing model for salient object detection,” in CVPR, 2018, pp. 1741–1750.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- J. H. Reynolds and R. Desimone, “Interacting roles of attention and visual salience in v4,” Neuron, vol. 37, no. 5, pp. 853–863, 2003.
- C. T. Morgan, “Physiological psychology.” 1943.
- R. Desimone and J. Duncan, “Neural mechanisms of selective visual attention,” Annual review of neuroscience, vol. 18, no. 1, pp. 193–222, 1995.
- N. Liu, J. Han, and M.-H. Yang, “Picanet: Learning pixel-wise contextual attention for saliency detection,” in CVPR, 2018, pp. 3089–3098.
- S. Ren, W. Liu, J. Jiao, G. Han, and S. He, “Edge distraction-aware salient object detection,” IEEE MultiMedia, vol. 30, no. 3, pp. 63–73, 2023.
- X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang, “Progressive attention guided recurrent network for salient object detection,” in CVPR, 2018, pp. 714–722.
- S. He, J. Jiao, X. Zhang, G. Han, and R. W. Lau, “Delving into salient object subitizing and detection,” in ICCV, 2017, pp. 1059–1067.
- S. Ren, C. Han, X. Yang, G. Han, and S. He, “Tenet: Triple excitation network for video salient object detection,” in ECCV. Springer, 2020, pp. 212–228.
- S. Ren, W. Liu, Y. Liu, H. Chen, G. Han, and S. He, “Reciprocal transformations for unsupervised video object segmentation,” in CVPR, June 2021, pp. 15 455–15 464.
- B. Wang, W. Liu, G. Han, and S. He, “Learning long-term structural dependencies for video salient object detection,” IEEE TIP, vol. 29, pp. 9017–9031, 2020.
- X. Zhou, Z. Liu, C. Gong, and W. Liu, “Improving video saliency detection via localized estimation and spatiotemporal refinement,” IEEE Transactions on Multimedia, vol. 20, no. 11, pp. 2993–3007, 2018.
- T. Huang, X. Ben, C. Gong, B. Zhang, R. Yan, and Q. Wu, “Enhanced spatial-temporal salience for cross-view gait recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 10, pp. 6967–6980, 2022.
- W. Wang, Q. Lai, H. Fu, J. Shen, H. Ling, and R. Yang, “Salient object detection in the deep learning era: An in-depth survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3239–3259, 2021.
- A. Borji, M.-M. Cheng, Q. Hou, H. Jiang, and J. Li, “Salient object detection: A survey,” Computational visual media, vol. 5, pp. 117–150, 2019.
- Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. Torr, “Deeply supervised salient object detection with short connections,” in CVPR, 2017, pp. 3203–3212.
- J.-J. Liu, Q. Hou, M.-M. Cheng, J. Feng, and J. Jiang, “A simple pooling-based design for real-time salient object detection,” in CVPR, 2019, pp. 3917–3926.
- P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan, “Amulet: Aggregating multi-level convolutional features for salient object detection,” in ICCV, 2017, pp. 202–211.
- Z. Luo, A. Mishra, A. Achkar, J. Eichel, S. Li, and P.-M. Jodoin, “Non-local deep features for salient object detection,” in CVPR, 2017, pp. 6609–6617.
- T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, and A. Borji, “Detect globally, refine locally: A novel approach to saliency detection,” in CVPR, 2018, pp. 3127–3135.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
- W. Wang, J. Shen, M.-M. Cheng, and L. Shao, “An iterative and cooperative top-down and bottom-up inference network for salient object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- W. Wang, J. Shen, X. Dong, A. Borji, and R. Yang, “Inferring salient objects from human fixations,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 8, pp. 1913–1927, 2019.
- B. Xu, H. Liang, R. Liang, and P. Chen, “Locate globally, segment locally: A progressive architecture with knowledge review network for salient object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 3004–3012.
- J.-X. Zhao, J.-J. Liu, D.-P. Fan, Y. Cao, J. Yang, and M.-M. Cheng, “Egnet: Edge guidance network for salient object detection,” in ICCV, 2019, pp. 8779–8788.
- X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in CVPR, 2018, pp. 7794–7803.
- I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention augmented convolutional networks,” in ICCV, 2019, pp. 3286–3295.
- H. Hu, Z. Zhang, Z. Xie, and S. Lin, “Local relation networks for image recognition,” in ICCV, 2019, pp. 3464–3473.
- H. Zhao, J. Jia, and V. Koltun, “Exploring self-attention for image recognition,” in CVPR, 2020, pp. 10 076–10 085.
- S. Ren, D. Zhou, S. He, J. Feng, and X. Wang, “Shunted self-attention via multi-scale token aggregation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 853–10 862.
- S. Ren, F. Wei, Z. Zhang, and H. Hu, “Tinymim: An empirical study of distilling mim pre-trained models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3687–3697.
- S. Ren, X. Yang, S. Liu, and X. Wang, “Sg-former: Self-guided transformer with evolving token reallocation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6003–6014.
- H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L.-C. Chen, “Axial-deeplab: Stand-alone axial-attention for panoptic segmentation,” in ECCV. Springer, 2020, pp. 108–126.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV. Springer, 2020, pp. 213–229.
- Y. Zeng, J. Fu, and H. Chao, “Learning joint spatial-temporal transformations for video inpainting,” in ECCV. Springer, 2020, pp. 528–543.
- S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” arXiv preprint arXiv:2012.15840, 2020.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- W. Wang, S. Zhao, J. Shen, S. C. Hoi, and A. Borji, “Salient object detection with pyramid attention and salient edges,” in CVPR, 2019, pp. 1448–1457.
- Z. Wu, L. Su, and Q. Huang, “Stacked cross refinement network for edge-aware salient object detection,” in ICCV, 2019, pp. 7264–7273.
- S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient object detection,” in ECCV, 2018, pp. 234–250.
- J. Wei, S. Wang, Z. Wu, C. Su, Q. Huang, and Q. Tian, “Label decoupling framework for salient object detection,” in CVPR, June 2020.
- X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand, “Basnet: Boundary-aware salient object detection,” in CVPR, June 2019.
- Y. Pang, X. Zhao, L. Zhang, and H. Lu, “Multi-scale interactive network for salient object detection,” in CVPR, June 2020.
- C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in CVPR. IEEE, 2013, pp. 3166–3173.
- G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in CVPR, June 2015, pp. 5455–5463.
- J. Shi, Q. Yan, L. Xu, and J. Jia, “Hierarchical image saliency detection on extended cssd,” IEEE TPAMI, vol. 38, no. 4, pp. 717–729, 2015.
- L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan, “Learning to detect salient objects with image-level supervision,” in CVPR, 2017.
- Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille, “The secrets of salient object segmentation,” in CVPR, 2014, pp. 280–287.
- D.-P. Fan, M.-M. Cheng, Y. Liu, T. Li, and A. Borji, “Structure-measure: A new way to evaluate foreground maps,” in ICCV, 2017, pp. 4548–4557.
- P. Zhang, D. Wang, H. Lu, H. Wang, and B. Yin, “Learning uncertain convolutional features for accurate saliency detection,” in ICCV, Oct 2017.
- P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan, “Amulet: Aggregating multi-level convolutional features for salient object detection,” in ICCV, Oct 2017.
- S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient object detection,” in European Conference on Computer Vision, 2018.
- J.-R. Chang and Y.-S. Chen, “Pyramid stereo matching network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5410–5418.
- Z. Wu, L. Su, and Q. Huang, “Cascaded partial decoder for fast and accurate salient object detection,” in CVPR, June 2019.
- Z. Wu, L. Su, and Q. Huang, “Stacked cross refinement network for edge-aware salient object detection,” in ICCV, 2019, pp. 7263–7272.
- X. Hu, C.-W. Fu, L. Zhu, T. Wang, and P.-A. Heng, “Sac-net: Spatial attenuation context for salient object detection,” IEEE Transactions on Circuits and Systems for Video Technology, 2020, to appear.
- J.-J. Liu, Q. Hou, Z.-A. Liu, and M.-M. Cheng, “Poolnet+: Exploring the potential of pooling for salient object detection,” IEEE TPAMI, vol. 45, no. 1, pp. 887–904, 2023.
- Y. K. Yun and T. Tsubono, “Recursive contour-saliency blending network for accurate salient object detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2022, pp. 2940–2950.
- N. Liu, N. Zhang, K. Wan, J. Han, and L. Shao, “Visual saliency transformer,” CoRR, vol. abs/2104.12099, 2021. [Online]. Available: https://arxiv.org/abs/2104.12099
- C. Yao, L. Feng, Y. Kong, L. Xiao, and T. Chen, “Transformers and cnns fusion network for salient object detection,” Neurocomputing, vol. 520, pp. 342–355, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231222013704