Calibration & Reconstruction: Deep Integrated Language for Referring Image Segmentation (2404.08281v1)
Abstract: Referring image segmentation aims to segment an object referred to by natural language expression from an image. The primary challenge lies in the efficient propagation of fine-grained semantic information from textual features to visual features. Many recent works utilize a Transformer to address this challenge. However, conventional transformer decoders can distort linguistic information with deeper layers, leading to suboptimal results. In this paper, we introduce CRFormer, a model that iteratively calibrates multi-modal features in the transformer decoder. We start by generating language queries using vision features, emphasizing different aspects of the input language. Then, we propose a novel Calibration Decoder (CDec) wherein the multi-modal features can iteratively calibrated by the input language features. In the Calibration Decoder, we use the output of each decoder layer and the original language features to generate new queries for continuous calibration, which gradually updates the language features. Based on CDec, we introduce a Language Reconstruction Module and a reconstruction loss. This module leverages queries from the final layer of the decoder to reconstruct the input language and compute the reconstruction loss. This can further prevent the language information from being lost or distorted. Our experiments consistently show the superior performance of our approach across RefCOCO, RefCOCO+, and G-Ref datasets compared to state-of-the-art methods.
- Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E Hinton “Layer normalization” In arXiv preprint arXiv:1607.06450, 2016
- “End-to-end object detection with transformers” In European conference on computer vision, 2020, pp. 213–229 Springer
- “History aware multimodal transformer for vision-and-language navigation” In Advances in neural information processing systems 34, 2021, pp. 5834–5847
- “Parallel vertex diffusion for unified visual grounding” In arXiv preprint arXiv:2303.07216, 2023
- “Bert: Pre-training of deep bidirectional transformers for language understanding” In arXiv preprint arXiv:1810.04805, 2018
- “Vision-language transformer and query generation for referring segmentation” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16321–16330
- “Vlt: Vision-language transformer and query generation for referring segmentation” In IEEE Transactions on Pattern Analysis and Machine Intelligence IEEE, 2022
- “An image is worth 16x16 words: Transformers for image recognition at scale” In arXiv preprint arXiv:2010.11929, 2020
- “Encoder fusion network with co-attention embedding for referring image segmentation” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15506–15515
- Ronghang Hu, Marcus Rohrbach and Trevor Darrell “Segmentation from natural language expressions” In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, 2016, pp. 108–124 Springer
- “Bi-directional relationship inferring network for referring image segmentation” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4424–4433
- “Pixel-bert: Aligning image pixels with text by deep multi-modal transformers” In arXiv preprint arXiv:2004.00849, 2020
- “Locate then segment: A strong pipeline for referring image segmentation” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9858–9867
- “Restr: Convolution-free referring image segmentation using transformers” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18145–18154
- Wonjae Kim, Bokyung Son and Ildoo Kim “Vilt: Vision-and-language transformer without convolution or region supervision” In International Conference on Machine Learning, 2021, pp. 5583–5594 PMLR
- “Referring transformer: A one-step approach to multi-task visual grounding” In Advances in neural information processing systems 34, 2021, pp. 19652–19664
- “Referring image segmentation via recurrent refinement networks” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5745–5753
- “Microsoft coco: Common objects in context” In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 2014, pp. 740–755 Springer
- “Multi-modal mutual attention and iterative interaction for referring image segmentation” In IEEE Transactions on Image Processing IEEE, 2023
- “Recurrent multimodal interaction for referring image segmentation” In Proceedings of the IEEE international conference on computer vision, 2017, pp. 1271–1280
- “Swin transformer: Hierarchical vision transformer using shifted windows” In Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022
- “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks” In Advances in neural information processing systems 32, 2019
- “Cascade grouped attention network for referring expression segmentation” In Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1274–1282
- “Multi-task collaborative network for joint referring expression comprehension and segmentation” In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2020, pp. 10034–10043
- “Generation and comprehension of unambiguous object descriptions” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 11–20
- Varun K Nagaraja, Vlad I Morariu and Larry S Davis “Modeling context between objects for referring expression understanding” In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, 2016, pp. 792–807 Springer
- “Zero-shot text-to-image generation” In International Conference on Machine Learning, 2021, pp. 8821–8831 PMLR
- “Attention is all you need” In Advances in neural information processing systems 30, 2017
- “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6629–6638
- “Cris: Clip-driven referring image segmentation” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11686–11695
- “Towards robust referring image segmentation” In arXiv preprint arXiv:2209.09554, 2023
- “SegFormer: Simple and efficient design for semantic segmentation with transformers” In Advances in Neural Information Processing Systems 34, 2021, pp. 12077–12090
- “Lavt: Language-aware vision transformer for referring image segmentation” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18155–18165
- “Mattnet: Modular attention network for referring expression comprehension” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1307–1315
- “Modeling context in referring expressions” In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, 2016, pp. 69–85 Springer
- Yichen Yan (10 papers)
- Xingjian He (25 papers)
- Sihan Chen (39 papers)
- Jing Liu (525 papers)