Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation (2403.01818v3)

Published 4 Mar 2024 in cs.CV and cs.AI

Abstract: Semi-supervised semantic segmentation (SSSS) has been proposed to alleviate the burden of time-consuming pixel-level manual labeling, which leverages limited labeled data along with larger amounts of unlabeled data. Current state-of-the-art methods train the labeled data with ground truths and unlabeled data with pseudo labels. However, the two training flows are separate, which allows labeled data to dominate the training process, resulting in low-quality pseudo labels and, consequently, sub-optimal results. To alleviate this issue, we present AllSpark, which reborns the labeled features from unlabeled ones with the channel-wise cross-attention mechanism. We further introduce a Semantic Memory along with a Channel Semantic Grouping strategy to ensure that unlabeled features adequately represent labeled features. The AllSpark shed new light on the architecture level designs of SSSS rather than framework level, which avoids increasingly complicated training pipeline designs. It can also be regarded as a flexible bottleneck module that can be seamlessly integrated into a general transformer-based segmentation model. The proposed AllSpark outperforms existing methods across all evaluation protocols on Pascal, Cityscapes and COCO benchmarks without bells-and-whistles. Code and model weights are available at: https://github.com/xmed-lab/AllSpark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Mixmatch: A holistic approach to semi-supervised learning. NeurIPS, 32, 2019.
  2. Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition. In ICCV, 2023.
  3. Crossvit: Cross-attention multi-scale vision transformer for image classification. In ICCV, pages 357–366, 2021a.
  4. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI, 40(4):834–848, 2017.
  5. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pages 801–818, 2018.
  6. Semi-supervised semantic segmentation with cross pseudo supervision. In CVPR, pages 2613–2622, 2021b.
  7. Per-pixel classification is not all you need for semantic segmentation. NeurIPS, 34:17864–17875, 2021.
  8. The cityscapes dataset for semantic urban scene understanding. In CVPR, pages 3213–3223, 2016.
  9. Davit: Dual attention vision transformers. In ECCV, pages 74–92. Springer, 2022.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  11. The pascal visual object classes challenge: A retrospective. IJCV, 111:98–136, 2015.
  12. Semi-supervised semantic segmentation needs strong, high-dimensional perturbations. BMVC, 2020.
  13. Transformer in transformer. NeurIPS, 34:15908–15919, 2021.
  14. Semantic contours from inverse detectors. In 2011 international conference on computer vision, pages 991–998. IEEE, 2011.
  15. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  16. Semi-supervised semantic segmentation via adaptive equalization learning. NeurIPS, 34:22106–22118, 2021.
  17. Semicvt: Semi-supervised convolutional vision transformer for semantic segmentation. In CVPR, pages 11340–11349, 2023.
  18. Oneformer: One transformer to rule universal image segmentation. In CVPR, pages 2989–2998, 2023a.
  19. Semask: Semantically masked transformers for semantic segmentation. In ICCV, pages 752–761, 2023b.
  20. Semi-supervised semantic segmentation via gentle teaching assistant. NeurIPS, 35:2803–2816, 2022.
  21. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  22. Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICML, Workshops, page 896, 2013.
  23. Semi-supervised semantic segmentation under label noise via diverse learning groups. In ICCV, pages 1229–1238, 2023a.
  24. Cfcg: Semi-supervised semantic segmentation via cross-fusion and contour guidance supervision. In ICCV, pages 16348–16358, 2023b.
  25. Diverse cotraining makes strong semi-supervised segmentor. In ICCV, 2023c.
  26. Logic-induced diagnostic reasoning for semi-supervised semantic segmentation. In ICCV, pages 16197–16208, 2023.
  27. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
  28. Perturbed and strict mean teachers for semi-supervised semantic segmentation. In CVPR, pages 4258–4267, 2022.
  29. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
  30. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
  31. Enhanced soft label for semi-supervised semantic segmentation. In ICCV, pages 1185–1195, 2023.
  32. Fuzzy positive learning for semi-supervised semantic segmentation. In CVPR, pages 15465–15474, 2023.
  33. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
  34. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. NeurIPS, 33:596–608, 2020.
  35. Segmenter: Transformer for semantic segmentation. In ICCV, pages 7262–7272, 2021.
  36. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS, 30, 2017.
  37. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
  38. Sequence length is a domain: Length-based overfitting in transformer models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8246–8257, 2021.
  39. Attention is all you need. NeurIPS, 30, 2017.
  40. Uctransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer. In AAAI, pages 2441–2449, 2022a.
  41. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 568–578, 2021.
  42. Hunting sparsity: Density-guided contrastive learning for semi-supervised semantic segmentation. In CVPR, pages 3114–3123, 2023a.
  43. Semi-supervised semantic segmentation using unreliable pseudo-labels. In CVPR, pages 4248–4257, 2022b.
  44. Conflict-based cross-view consistency for semi-supervised semantic segmentation. In CVPR, pages 19585–19595, 2023b.
  45. Cvt: Introducing convolutions to vision transformers. In ICCV, pages 22–31, 2021.
  46. Segformer: Simple and efficient design for semantic segmentation with transformers. NeurIPS, 34:12077–12090, 2021.
  47. St++: Make self-training work better for semi-supervised semantic segmentation. In CVPR, pages 4268–4277, 2022.
  48. Revisiting weak-to-strong consistency in semi-supervised semantic segmentation. In CVPR, pages 7236–7246, 2023.
  49. Metaformer is actually what you need for vision. In CVPR, pages 10819–10829, 2022.
  50. Semi-supervised semantic segmentation with mutual knowledge distillation. In ACM MM, pages 5436–5444, 2023.
  51. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV, pages 558–567, 2021.
  52. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. NeurIPS, 34:18408–18419, 2021.
  53. Pyramid scene parsing network. In CVPR, pages 2881–2890, 2017.
  54. Augmentation matters: A simple-yet-effective approach to semi-supervised semantic segmentation. In CVPR, pages 11350–11359, 2023.
  55. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, pages 6881–6890, 2021.
  56. Pixel contrastive-consistent semi-supervised semantic segmentation. In ICCV, pages 7273–7282, 2021.
  57. Pseudoseg: Designing pseudo labels for semantic segmentation. ICLR, 2020.
Citations (7)

Summary

  • The paper introduces AllSpark, a novel transformer-based module that integrates unlabeled data through channel-wise cross-attention to enhance feature labeling for semantic segmentation.
  • It employs a FIFO semantic memory and channel-wise grouping to effectively reconstruct labeled features, ensuring robust performance across varied benchmarks.
  • Experiments show significant mIoU gains on datasets like PASCAL VOC and Cityscapes, demonstrating improved segmentation with minimal changes to training pipelines.

Overview of "AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation"

The paper "AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation" addresses a notable challenge in the domain of semi-supervised semantic segmentation (SSSS). This challenge is the reliance on manual pixel-level labeling, which is not only labor-intensive but also a significant bottleneck in scaling segmentation models. Current state-of-the-art methods predominantly utilize pseudo labeling for unlabeled data, which segregates the training flows and causes the labeled data to dominate the training process. This practice often results in the generation of low-quality pseudo labels, leading to sub-optimal model performance.

Methodology

The authors introduce a novel approach termed "AllSpark," which integrates unlabeled features into the training flow of labeled features through a channel-wise cross-attention mechanism. This integration effectively "rebirths" the labeled features, allowing them to benefit from the more diverse and comprehensive perspectives offered by the unlabeled data. The key components of the proposed method are:

  1. Channel-Wise Cross-Attention Mechanism: This mechanism forms the core of the AllSpark module, leveraging contextual information from unlabeled data to reconstruct the labeled features and thus preventing the dominance of labeled data during training.
  2. Semantic Memory (S-Mem): To overcome the limitations posed by a single mini-batch of unlabeled data, the authors adopt a FIFO queue to store features of unlabeled data. This expands the available feature space and allows for a more robust reconstruction process.
  3. Channel-wise Semantic Grouping: This strategy allows for the efficient update of the semantic memory by categorizing channels based on similarities with the probability maps from previous unlabeled features.

Results

The AllSpark module exhibits promising results across multiple benchmarks such as PASCAL VOC 2012, Cityscapes, and COCO datasets, outperforming existing SSSS methods for each test scenario without significant alterations to the existing training pipelines. Specific gains in mIoU illustrate the effectiveness of their approach:

  • On the PASCAL VOC 2012 original dataset, with 1/8 labeled data, AllSpark attained an mIoU of 78.41%, compared to previous best-state achievements at 77.19%.
  • On the Cityscapes dataset with 1/8 labeled data, the model reached 79.24% mIoU.
  • On the challenging COCO dataset, AllSpark consistently achieved higher mIoU scores across all labeling ratios.

Implications and Future Directions

The AllSpark approach provides a shift from the previous paradigm by offering architecture-level modifications that capitalizes on transformers' capacity to leverage vast amounts of unlabeled data effectively. This design could streamline the training of foundation models in sparse labeling scenarios.

Given the results, the AllSpark framework not only sets a new benchmark in semi-supervised semantic segmentation but also elicits further exploration into integrating channel-wise attention mechanisms within different levels of segmentation models. Future research directions could involve adapting this methodology for other types of semi-supervised tasks in computer vision or enhancing its computational efficiency for broader application in industrial settings. Furthermore, the paper opens inquiries into optimizing semantic memory configurations or exploring other memory bank strategies for continuous improvement.