Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Calibration & Reconstruction: Deep Integrated Language for Referring Image Segmentation (2404.08281v1)

Published 12 Apr 2024 in cs.CV and cs.MM

Abstract: Referring image segmentation aims to segment an object referred to by natural language expression from an image. The primary challenge lies in the efficient propagation of fine-grained semantic information from textual features to visual features. Many recent works utilize a Transformer to address this challenge. However, conventional transformer decoders can distort linguistic information with deeper layers, leading to suboptimal results. In this paper, we introduce CRFormer, a model that iteratively calibrates multi-modal features in the transformer decoder. We start by generating language queries using vision features, emphasizing different aspects of the input language. Then, we propose a novel Calibration Decoder (CDec) wherein the multi-modal features can iteratively calibrated by the input language features. In the Calibration Decoder, we use the output of each decoder layer and the original language features to generate new queries for continuous calibration, which gradually updates the language features. Based on CDec, we introduce a Language Reconstruction Module and a reconstruction loss. This module leverages queries from the final layer of the decoder to reconstruct the input language and compute the reconstruction loss. This can further prevent the language information from being lost or distorted. Our experiments consistently show the superior performance of our approach across RefCOCO, RefCOCO+, and G-Ref datasets compared to state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E Hinton “Layer normalization” In arXiv preprint arXiv:1607.06450, 2016
  2. “End-to-end object detection with transformers” In European conference on computer vision, 2020, pp. 213–229 Springer
  3. “History aware multimodal transformer for vision-and-language navigation” In Advances in neural information processing systems 34, 2021, pp. 5834–5847
  4. “Parallel vertex diffusion for unified visual grounding” In arXiv preprint arXiv:2303.07216, 2023
  5. “Bert: Pre-training of deep bidirectional transformers for language understanding” In arXiv preprint arXiv:1810.04805, 2018
  6. “Vision-language transformer and query generation for referring segmentation” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16321–16330
  7. “Vlt: Vision-language transformer and query generation for referring segmentation” In IEEE Transactions on Pattern Analysis and Machine Intelligence IEEE, 2022
  8. “An image is worth 16x16 words: Transformers for image recognition at scale” In arXiv preprint arXiv:2010.11929, 2020
  9. “Encoder fusion network with co-attention embedding for referring image segmentation” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15506–15515
  10. Ronghang Hu, Marcus Rohrbach and Trevor Darrell “Segmentation from natural language expressions” In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, 2016, pp. 108–124 Springer
  11. “Bi-directional relationship inferring network for referring image segmentation” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4424–4433
  12. “Pixel-bert: Aligning image pixels with text by deep multi-modal transformers” In arXiv preprint arXiv:2004.00849, 2020
  13. “Locate then segment: A strong pipeline for referring image segmentation” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9858–9867
  14. “Restr: Convolution-free referring image segmentation using transformers” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18145–18154
  15. Wonjae Kim, Bokyung Son and Ildoo Kim “Vilt: Vision-and-language transformer without convolution or region supervision” In International Conference on Machine Learning, 2021, pp. 5583–5594 PMLR
  16. “Referring transformer: A one-step approach to multi-task visual grounding” In Advances in neural information processing systems 34, 2021, pp. 19652–19664
  17. “Referring image segmentation via recurrent refinement networks” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5745–5753
  18. “Microsoft coco: Common objects in context” In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 2014, pp. 740–755 Springer
  19. “Multi-modal mutual attention and iterative interaction for referring image segmentation” In IEEE Transactions on Image Processing IEEE, 2023
  20. “Recurrent multimodal interaction for referring image segmentation” In Proceedings of the IEEE international conference on computer vision, 2017, pp. 1271–1280
  21. “Swin transformer: Hierarchical vision transformer using shifted windows” In Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022
  22. “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks” In Advances in neural information processing systems 32, 2019
  23. “Cascade grouped attention network for referring expression segmentation” In Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1274–1282
  24. “Multi-task collaborative network for joint referring expression comprehension and segmentation” In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2020, pp. 10034–10043
  25. “Generation and comprehension of unambiguous object descriptions” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 11–20
  26. Varun K Nagaraja, Vlad I Morariu and Larry S Davis “Modeling context between objects for referring expression understanding” In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, 2016, pp. 792–807 Springer
  27. “Zero-shot text-to-image generation” In International Conference on Machine Learning, 2021, pp. 8821–8831 PMLR
  28. “Attention is all you need” In Advances in neural information processing systems 30, 2017
  29. “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6629–6638
  30. “Cris: Clip-driven referring image segmentation” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11686–11695
  31. “Towards robust referring image segmentation” In arXiv preprint arXiv:2209.09554, 2023
  32. “SegFormer: Simple and efficient design for semantic segmentation with transformers” In Advances in Neural Information Processing Systems 34, 2021, pp. 12077–12090
  33. “Lavt: Language-aware vision transformer for referring image segmentation” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18155–18165
  34. “Mattnet: Modular attention network for referring expression comprehension” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1307–1315
  35. “Modeling context in referring expressions” In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, 2016, pp. 69–85 Springer
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yichen Yan (10 papers)
  2. Xingjian He (25 papers)
  3. Sihan Chen (39 papers)
  4. Jing Liu (525 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com