Multi-Modal Prototypes for Open-World Semantic Segmentation (2307.02003v3)
Abstract: In semantic segmentation, generalizing a visual system to both seen categories and novel categories at inference time has always been practically valuable yet challenging. To enable such functionality, existing methods mainly rely on either providing several support demonstrations from the visual aspect or characterizing the informative clues from the textual aspect (e.g., the class names). Nevertheless, both two lines neglect the complementary intrinsic of low-level visual and high-level language information, while the explorations that consider visual and textual modalities as a whole to promote predictions are still limited. To close this gap, we propose to encompass textual and visual clues as multi-modal prototypes to allow more comprehensive support for open-world semantic segmentation, and build a novel prototype-based segmentation framework to realize this promise. To be specific, unlike the straightforward combination of bi-modal clues, we decompose the high-level language information as multi-aspect prototypes and aggregate the low-level visual information as more semantic prototypes, on basis of which, a fine-grained complementary fusion makes the multi-modal prototypes more powerful and accurate to promote the prediction. Based on an elastic mask prediction module that permits any number and form of prototype inputs, we are able to solve the zero-shot, few-shot and generalized counterpart tasks in one architecture. Extensive experiments on both PASCAL-$5i$ and COCO-$20i$ datasets show the consistent superiority of the proposed method compared with the previous state-of-the-art approaches, and a range of ablation studies thoroughly dissects each component in our framework both quantitatively and qualitatively that verify their effectiveness.
- Franz Aurenhammer. Voronoi diagrams—a survey of a fundamental geometric data structure. ACM Computing Surveys (CSUR), 1991.
- Exploiting a joint embedding space for generalized zero-shot semantic segmentation. In Proceedings of the International Conference on Computer Vision, 2021.
- Few-shot segmentation without meta-learning: A good transductive inference is all you need? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
- Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020.
- Zero-shot semantic segmentation. Advances in Neural Information Processing Systems, 2019.
- Modeling the background for incremental learning in semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
- Per-pixel classification is not all you need for semantic segmentation. In Advances in Neural Information Processing Systems, 2021.
- The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 2015.
- Weakly and semi supervised human body part parsing via pose-guided knowledge transfer. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- Scaling open-vocabulary image segmentation with image-level labels. In Proceedings of the European Conference on Computer Vision, 2022.
- Context-aware feature generation for zero-shot semantic segmentation. In Proceedings of the 28th ACM International Conference on Multimedia, 2020.
- A strong baseline for generalized few-shot semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
- Simultaneous detection and segmentation. In Proceedings of the European Conference on Computer Vision, 2014.
- Cost aggregation is all you need for few-shot segmentation. arXiv:2112.11685 [cs], 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, 2021.
- Prompting visual-language models for efficient video understanding. In Proceedings of the European Conference on Computer Vision, 2022.
- Multi-modal prompting for low-shot temporal action localization. arXiv preprint arXiv:2303.11732, 2023.
- Constraint and union for partially-supervised temporal sentence grounding. arXiv preprint arXiv:2302.09850, 2023.
- Divide and conquer for single-frame temporal action localization. In Proceedings of the International Conference on Computer Vision, 2021.
- Adaptive mutual supervision for weakly-supervised temporal action localization. IEEE Transactions on Multimedia, 2022.
- Distilling vision-language pre-training to collaborate with weakly-supervised temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
- Learning what not to segment: A new perspective on fewshot segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Language-driven semantic segmentation. In Proceedings of the International Conference on Learning Representations, 2022.
- Consistent structural relation learning for zero-shot segmentation. Advances in Neural Information Processing Systems, 2020.
- Open-vocabulary semantic segmentation with mask-adapted clip. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
- Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, 2014.
- Exploiting transformation invariance and equivariance for self-supervised sound localisation. In Proceedings of ACM International Conference on Multimedia, 2022.
- Part-aware prototype network for few-shot semantic segmentation. Proceedings of the European Conference on Computer Vision, 2020.
- Simpler is better: Few-shot semantic segmentation with classifier weight transformer. Proceedings of the International Conference on Computer Vision, 2021.
- Diffusionseg: Adapting diffusion towards unsupervised object discovery. arXiv preprint arXiv:2303.09813, 2023.
- Open-vocabulary semantic segmentation with frozen vision-language models. In Proceedings of the British Machine Vision Conference, 2022.
- Hypercorrelation squeeze for few-shot segmentation. In Proceedings of the International Conference on Computer Vision, 2021.
- Pose-guided knowledge transfer for object part segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020.
- Pose-guided knowledge transfer for object part segmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020.
- Feature weighting and boosting for few-shot segmentation. In Proceedings of the International Conference on Computer Vision, 2019.
- Mutual learning to adapt for joint human parsing and pose estimation. In Proceedings of the European Conference on Computer Vision, 2018.
- OpenAI. Gpt-4 technical report, 2023.
- Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, 2021.
- Vision transformers for dense prediction. In Proceedings of the International Conference on Computer Vision, 2021.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention, 2015.
- One-shot learning for semantic segmentation. In Proceedings of the British Machine Vision Conference, 2017.
- Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
- Automatic instrument segmentation in robot-assisted surgery using deep learning. In 2018 17th IEEE international conference on machine learning and applications (ICMLA), 2018.
- Segmenter: Transformer for semantic segmentation. In Proceedings of the International Conference on Computer Vision, 2021.
- Generalized few-shot semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Prior guided feature enrichment network for few-shot segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- Few-shot semantic segmentation with democratic attention networks. Proceedings of the European Conference on Computer Vision, 2020.
- Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the International Conference on Computer Vision, 2019.
- Semantic projection network for zero-and few-label semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- Segformer: Simple and efficient design for semantic segmentation with transformers. In Advances in Neural Information Processing Systems, 2021.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021.
- A simple baseline for open vocabulary semantic segmentation with pre-trained vision-language model. Proceedings of the European Conference on Computer Vision, 2022.
- Filip: Fine-grained interactive language-image pre-training. In Proceedings of the International Conference on Learning Representations, 2022.
- Learning adaptive classifiers synthesis for generalized few-shot learning. International Journal of Computer Vision, 2021.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Self-guided and cross-guided learning for few-shot segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
- Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- Few-shot segmentation via cycle-consistent transformer. Advances in Neural Information Processing Systems, 2021.
- Feature-proxy transformer for few-shot segmentation. Advances in Neural Information Processing Systems, 2022.
- Bottom-up temporal action localization with mutual regularization. In Proceedings of the European Conference on Computer Vision, 2020.