Interactive Image Segmentation with Cross-Modality Vision Transformers (2307.02280v1)
Abstract: Interactive image segmentation aims to segment the target from the background with the manual guidance, which takes as input multimodal data such as images, clicks, scribbles, and bounding boxes. Recently, vision transformers have achieved a great success in several downstream visual tasks, and a few efforts have been made to bring this powerful architecture to interactive segmentation task. However, the previous works neglect the relations between two modalities and directly mock the way of processing purely visual information with self-attentions. In this paper, we propose a simple yet effective network for click-based interactive segmentation with cross-modality vision transformers. Cross-modality transformers exploits mutual information to better guide the learning process. The experiments on several benchmarks show that the proposed method achieves superior performance in comparison to the previous state-of-the-art models. The stability of our method in term of avoiding failure cases shows its potential to be a practical annotation tool. The code and pretrained models will be released under https://github.com/lik1996/iCMFormer.
- Efficient interactive annotation of segmentation datasets with polygon-rnn++. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 859–868, 2018.
- Semantic object selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3150–3157, 2014.
- Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015.
- Econet: Efficient convolutional online likelihood network for scribble-based interactive segmentation. In International Conference on Medical Imaging with Deep Learning, pages 35–47, 2022.
- Conditional diffusion for interactive segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 7345–7354, 2021.
- Focalclick: towards practical interactive image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1300–1309, 2022.
- Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, pages 9355–9366, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Efficient mask correction for click-based interactive image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22773–22782, 2023.
- The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88:303–308, 2009.
- Leo Grady. Random walks for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11):1768–1783, 2006.
- Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5356–5364, 2019.
- Semantic contours from inverse detectors. In Proceedings of the IEEE International Conference on Computer Vision, pages 991–998, 2011.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Interactive image segmentation via backpropagating refinement scheme. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5297–5306, 2019.
- Maskgan: Towards diverse and interactive facial image manipulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5549–5558, 2020.
- Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pages 280–296. Springer, 2022.
- Lazy snapping. ACM Transactions on Graphics (ToG), 23(3):303–308, 2004.
- Interactive image segmentation with latent diversity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 577–585, 2018.
- Regional interactive image segmentation networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2746–2754, 2017.
- Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
- Focuscut: Diving into a focus view in interactive segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2637–2646, 2022.
- Interactive image segmentation with first click attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 13339–13348, 2020.
- Pseudoclick: Interactive image segmentation with click imitation. In European Conference on Computer Vision, pages 728–745. Springer, 2022.
- Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8759–8768, 2018.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE International Conference on Computer Vision, pages 10012–10022, 2021.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
- Iteratively trained interactive segmentation. arXiv preprint arXiv:1805.04398, 2018.
- Content-aware multi-level guidance for interactive instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11602–11611, 2019.
- Deep extreme cut: From extreme points to object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 616–625, 2018.
- A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the IEEE International Conference on Computer Vision, pages 416–423, 2001.
- Multimodal deep learning. In Proceedings of the International Conference on Machine Learning, pages 689–696, 2011.
- A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 724–732, 2016.
- End-to-end pseudo-lidar for image-based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5881–5890, 2020.
- ”grabcut” interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG), 23(3):309–314, 2004.
- Deep learning in medical image analysis. Annual Review of Biomedical Engineering, 19:221–248, 2017.
- A comparative study of real-time semantic segmentation for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 587–597, 2018.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Adaptis: Adaptive instance selection network. In Proceedings of the IEEE International Conference on Computer Vision, pages 7355–7363, 2019.
- f-brs: Rethinking backpropagating refinement for interactive segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8623–8632, 2020.
- Reviving iterative training with mask guidance for interactive segmentation. In IEEE International Conference on Image Processing (ICIP), pages 3141–3145, 2022.
- Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE International Conference on Computer Vision, pages 568–578, 2021.
- Focused and collaborative feedback integration for interactive image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18643–18652, 2023.
- Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, pages 12077–12090, 2021.
- Deep grabcut for object selection. arXiv preprint arXiv:1707.00243, 2017.
- Deep interactive object selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 373–381, 2016.
- Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6281–6290, 2019.
- Hrformer: High-resolution vision transformer for dense predict. Advances in Neural Information Processing Systems, pages 7281–7293, 2021.
- X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8563–8573, 2022.
- Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13041–13049, 2020.
- Interactive segmentation as gaussion process classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19488–19497, 2023.