Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation (2306.11087v1)

Published 19 Jun 2023 in cs.CV

Abstract: We study universal zero-shot segmentation in this work to achieve panoptic, instance, and semantic segmentation for novel categories without any training samples. Such zero-shot segmentation ability relies on inter-class relationships in semantic space to transfer the visual knowledge learned from seen categories to unseen ones. Thus, it is desired to well bridge semantic-visual spaces and apply the semantic relationships to visual feature learning. We introduce a generative model to synthesize features for unseen categories, which links semantic and visual spaces as well as addresses the issue of lack of unseen training data. Furthermore, to mitigate the domain gap between semantic and visual spaces, firstly, we enhance the vanilla generator with learned primitives, each of which contains fine-grained attributes related to categories, and synthesize unseen features by selectively assembling these primitives. Secondly, we propose to disentangle the visual feature into the semantic-related part and the semantic-unrelated part that contains useful visual classification clues but is less relevant to semantic representation. The inter-class relationships of semantic-related visual features are then required to be aligned with those in semantic space, thereby transferring semantic knowledge to visual feature learning. The proposed approach achieves impressively state-of-the-art performance on zero-shot panoptic segmentation, instance segmentation, and semantic segmentation. Code is available at https://henghuiding.github.io/PADing/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Evaluation of output embeddings for fine-grained image classification. In CVPR, 2015.
  2. Zero-shot object detection. In ECCV, 2018.
  3. Autoencoder based novelty detection for generalized zero shot learning. In ICIP. IEEE, 2019.
  4. Zero-shot semantic segmentation. NeurIPS, 32, 2019.
  5. Classifier and exemplar synthesis for zero-shot learning. IJCV, 128(1), 2020.
  6. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV. Springer, 2016.
  7. Weak-shot fine-grained classification via similarity transfer. NeurIPS, 2021.
  8. Weak-shot semantic segmentation via dual similarity transfer. NeurIPS, 2022.
  9. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI, 40(4), 2017.
  10. Semantics disentangling for generalized zero-shot learning. In ICCV, 2021.
  11. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
  12. Sign: Spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In ICCV, pages 9556–9566, October 2021.
  13. Zero-shot image recognition using relational matching, adaptation and calibration. In IJCNN. IEEE, 2019.
  14. Boundary-aware feature propagation for scene segmentation. In ICCV, 2019.
  15. Context contrasted feature and gated multi-scale aggregation for scene segmentation. In CVPR, 2018.
  16. Semantic correlation promoted shape-variant context for segmentation. In CVPR, 2019.
  17. Semantic segmentation with context encoding and multi-path decoding. IEEE TIP, 29, 2020.
  18. Vision-language transformer and query generation for referring segmentation. In ICCV, 2021.
  19. VLT: Vision-language transformer and query generation for referring segmentation. IEEE TPAMI, 2022.
  20. Decoupling zero-shot semantic segmentation. In CVPR, 2022.
  21. Improving zero-shot learning by mitigating the hubness problem. arXiv preprint arXiv:1412.6568, 2014.
  22. Federated incremental semantic segmentation. In CVPR, 2023.
  23. The pascal visual object classes challenge: A retrospective. IJCV, 111(1), 2015.
  24. Generalised zero-shot learning with domain classification in a joint semantic and visual space. In DICTA. IEEE, 2019.
  25. Generative adversarial networks. Communications of the ACM, 63(11), 2020.
  26. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
  27. Context-aware feature generation for zero-shot semantic segmentation. In ACM MM, 2020.
  28. Dual-view ranking with hardness assessment for zero-shot learning. In AAAI, 2019.
  29. Mask r-cnn. In ICCV, 2017.
  30. Deep residual learning for image recognition. In CVPR, 2016.
  31. Semantic-promoted debiasing and background disambiguation for zero-shot instance segmentation. In CVPR, 2023.
  32. Uncertainty-aware learning for zero-shot semantic segmentation. In NeurIPS, 2020.
  33. Fine-grained generalized zero-shot learning via dense attribute-based attention. In CVPR, 2020.
  34. Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling. In CVPR, 2022.
  35. Online incremental attribute-based zero-shot learning. In CVPR. IEEE, 2012.
  36. Panoptic feature pyramid networks. In CVPR, 2019.
  37. Panoptic segmentation. In CVPR, 2019.
  38. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009.
  39. Diverse image-to-image translation via disentangled representations. In ECCV, 2018.
  40. Consistent structural relation learning for zero-shot segmentation. NeurIPS, 33, 2020.
  41. Transformer-based visual segmentation: A survey. arXiv:2304.09854, 2023.
  42. Deep semantic structural constraints for zero-shot learning. In AAAI, 2018.
  43. Generative moment matching networks. In ICML, 2015.
  44. Microsoft coco: Common objects in context. In ECCV. Springer, 2014.
  45. GRES: Generalized referring expression segmentation. In CVPR, 2023.
  46. Instance-specific feature propagation for referring segmentation. IEEE TMM, 2022.
  47. Few-shot segmentation with optimal transport matching and message flow. IEEE TMM, 2022.
  48. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  49. Learning unbiased zero-shot semantic segmentation networks via transductive transfer. IEEE SPL, 27, 2020.
  50. Distributed representations of words and phrases and their compositionality. In NeurIPS, 2013.
  51. Zero-shot learning with semantic output codes. NeurIPS, 22, 2009.
  52. A closer look at self-training for zero-label semantic segmentation. In CVPRW, 2021.
  53. A review of generalized zero-shot learning methods. arXiv preprint arXiv:2011.08641, 2020.
  54. Learning transferable visual models from natural language supervision. In ICML. PMLR, 2021.
  55. Toward open set recognition. IEEE TPAMI, 35(7), 2012.
  56. Toward achieving robust low-level and high-level scene parsing. IEEE TIP, 2019.
  57. Hierarchical disentanglement of discriminative latent features for zero-shot learning. In CVPR, 2019.
  58. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. 9(11), 2008.
  59. Attention is all you need. In NeurIPS, 2017.
  60. Leveraging seen and unseen semantic relationships for generative zero-shot learning. In ECCV. Springer, 2020.
  61. Solo: Segmenting objects by locations. In ECCV. Springer, 2020.
  62. Semantic projection network for zero-and few-label semantic segmentation. In CVPR, 2019.
  63. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In ECCV. Springer, 2022.
  64. Designing category-level attributes for discriminative visual recognition. In CVPR, 2013.
  65. Open-vocabulary object detection using captions. In CVPR, 2021.
  66. Prototypical matching and open set rejection for zero-shot semantic segmentation. In ICCV, 2021.
  67. Zero-shot learning via joint latent similarity embedding. In CVPR, 2016.
  68. Zero-shot instance segmentation. In CVPR, 2021.
Citations (29)

Summary

  • The paper proposes PADing, a unified framework that integrates primitive generation with semantic-visual alignment to address zero-shot segmentation challenges.
  • It introduces a learnable primitive generator that synthesizes diverse features and employs feature disentanglement to bridge semantic and visual gaps.
  • Empirical results demonstrate significant improvements in standard metrics, highlighting enhanced performance across panoptic, instance, and semantic segmentation tasks.

The domain of zero-shot learning (ZSL) has witnessed a remarkable expansion into image segmentation tasks, an area defined by a need to generalize learned models to novel categories absent training samples. This paper, "Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation," addresses the challenge of performing panoptic, instance, and semantic segmentation on unseen classes. At its core, the paper adopts a novel framework that leverages cross-modal primitives and semantic visual alignment to achieve state-of-the-art results across these segmentation domains.

Framework Overview

The paper introduces a unified architecture known as Primitive Generation with collaborative relationship Alignment and feature Disentanglement learning (PADing), designed to tackle zero-shot segmentation challenges in a comprehensive manner. At the foundation of this approach is the generative model that synthesizes features for unseen categories bridging semantic and visual spaces, a critical requirement given the absence of training data for these novel categories.

The method is distinguished by:

  1. Primitive Generator: This aspect employs a set of learnable primitives to encapsulate fine-grained attribute information from visuals, facilitating robust synthetic feature generation for unseen categories. Unlike traditional models, which often employ direct mappings subject to semantic-visual gaps, this approach uses these primitives to ensure feature diversity and integrity.
  2. Feature Disentanglement: The visual feature space is divided into a semantic-related component, intended for alignment with the semantic space, and a semantic-unrelated component, which encompasses visual classification clues. This distinction allows for a more nuanced representation that better correlates with the semantic embeddings.
  3. Semantic-Visual Relationship Alignment: Through alignment, the proposed model transfers inter-class relationships from the semantic to the visual space, effectively integrating the structural class relationships that semantic embeddings naturally reveal.

Numerical Results and Implications

The paper's empirical findings underscore the effectiveness of PADing, showcasing considerable improvements in universal zero-shot segmentation tasks. It reports enhanced performance on standard datasets, achieving increased PQ, SQ, and RQ metrics in comparison to existing methods. For instance, the introduction of a primitive generator alone yields an appreciable increment in unseen category accuracy, and when combined with alignment and disentanglement, results in further performance optimizations.

The implications of these results are twofold:

  • Theoretical: The framework highlights the potential of integrating cross-modal generation with semantic guidance, providing a pathway for future research to explore component-based feature generation and domain alignment strategies.
  • Practical: On a practical front, the ability to perform segmentation on unknown categories without needing new labeled data expands the applicability of AI models to dynamic domains where such classes are continually encountered.

Speculations on Future Developments

This paper lays the groundwork for future exploration into feature synthesis and alignment in ZSL. Anticipated developments could include:

  • Enhanced primitive sets that automatically adapt to varying semantic complexities, thereby refining the generative process.
  • Techniques to further minimize the semantic-visual gap, potentially incorporating real-time adaptation mechanisms that refine synthesized features using minimal unseen class cues.
  • Expansion into multi-domain applications beyond standard segmentation, such as real-time video segmentation and dynamic scene analysis, where unseen objects may emerge unexpectedly.

In conclusion, the paper "Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation" offers a robust approach to tackling the zero-shot segmentation challenges, demonstrating that a comprehensive and nuanced framework capable of feature synthesis and semantic relationship alignment can significantly enhance the model's ability to generalize to unforeseen categories.