Framework-agnostic Semantically-aware Global Reasoning for Segmentation (2212.03338v2)
Abstract: Recent advances in pixel-level tasks (e.g. segmentation) illustrate the benefit of of long-range interactions between aggregated region-based representations that can enhance local features. However, such aggregated representations, often in the form of attention, fail to model the underlying semantics of the scene (e.g. individual objects and, by extension, their interactions). In this work, we address the issue by proposing a component that learns to project image features into latent representations and reason between them using a transformer encoder to generate contextualized and scene-consistent representations which are fused with original image features. Our design encourages the latent regions to represent semantic concepts by ensuring that the activated regions are spatially disjoint and the union of such regions corresponds to a connected object segment. The proposed semantic global reasoning (SGR) component is end-to-end trainable and can be easily added to a wide variety of backbones (CNN or transformer-based) and segmentation heads (per-pixel or mask classification) to consistently improve the segmentation results on different datasets. In addition, our latent tokens are semantically interpretable and diverse and provide a rich set of features that can be transferred to downstream tasks like object detection and segmentation, with improved performance. Furthermore, we also proposed metrics to quantify the semantics of latent tokens at both class & instance level.
- Semantic segmentation using regions and parts. In CVPR, pages 3378–3385. IEEE, 2012.
- Multiscale combinatorial grouping. In CVPR, pages 328–335, 2014.
- Convolutional random walk networks for semantic image segmentation. In CVPR, pages 858–866, 2017.
- Graph cuts and efficient nd image segmentation. IJCV, 70(2):109–131, 2006.
- An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. PAMI, 26(9):1124–1137, 2004.
- Coco-stuff: Thing and stuff classes in context. In CVPR, pages 1209–1218, 2018.
- End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
- Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021.
- Semantic segmentation with second-order pooling. In ECCV, pages 430–443. Springer, 2012.
- Cpmc: Automatic object segmentation using constrained parametric min-cuts. PAMI, 34(7):1312–1328, 2011.
- Dense and low-rank gaussian crfs using deep embeddings. In ICCV, pages 5103–5112, 2017.
- Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI, 40(4):834–848, 2017.
- Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
- Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pages 801–818, 2018.
- A^ 2-nets: Double attention networks. NIPS, 31, 2018.
- Graph-based global reasoning networks. In CVPR, pages 433–442, 2019.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
- Per-pixel classification is not all you need for semantic segmentation. NIPS, 34:17864–17875, 2021.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
- Convolutional feature masking for joint object and stuff segmentation. In CVPR, pages 3992–4000, 2015.
- Deformable convolutional networks. In ICCV, pages=764–773, year=2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Conditional Random Fields. Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
- Dual attention network for scene segmentation. In CVPR, pages 3146–3154, 2019.
- Dynamic perceiver for efficient visual recognition. ICCV, 2023.
- Mask r-cnn. In ICCV, pages 2961–2969, 2017.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets, pages 286–297. Springer, 1990.
- Ccnet: Criss-cross attention for semantic segmentation. In ICCV, pages 603–612, 2019.
- Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021.
- Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782, 2021.
- Mdetr-modulated detection for end-to-end multi-modal understanding. In ICCV, pages 1780–1790, 2021.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- Panoptic segmentation. In CVPR, 2019.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Efficient inference in fully connected crfs with gaussian edge potentials. NIPS, 24, 2011.
- Harold W Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955.
- Deep hierarchical semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1246–1257, 2022.
- Beyond grids: Learning graph representations for visual recognition. Advances in neural information processing systems, 31, 2018.
- Symbolic graph reasoning meets convolutions. NIPS, 31, 2018.
- Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
- Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
- Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
- V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 3DV, pages 565–571. IEEE, 2016.
- Normalized cuts and image segmentation. PAMI, 22(8):888–905, 2000.
- Segmenter: Transformer for semantic segmentation. In ICCV, pages 7262–7272, 2021.
- On the importance of initialization and momentum in deep learning. In ICML, pages 1139–1147. PMLR, 2013.
- Selective search for object recognition. IJCV, 104(2):154–171, 2013.
- Attention is all you need. NIPS, 30, 2017.
- Non-local neural networks. In CVPR, pages 7794–7803, 2018.
- Visual transformers: Where do transformers really belong in vision models? In ICCV, pages 599–609, 2021.
- Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018.
- Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
- Multi-class token transformer for weakly supervised semantic segmentation. In CVPR, pages 4310–4319, 2022.
- Denseaspp for semantic segmentation in street scenes. In CVPR, pages 3684–3692, 2018.
- Dilated residual networks. In CVPR, pages 472–480, 2017.
- Object-contextual representations for semantic segmentation. In ECCV, pages 173–190. Springer, 2020.
- Ocnet: Object context for semantic segmentation. IJCV, 129(8):2375–2398, 2021.
- Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In CVPR, pages 11101–11111, 2022.
- Acfnet: Attentional class feature network for semantic segmentation. In ICCV, pages 6798–6807, 2019.
- Context encoding for semantic segmentation. In CVPR, pages 7151–7160, 2018.
- Dynamic graph message passing networks. In CVPR, pages 3726–3735, 2020.
- Latentgnn: Learning efficient non-local relations for visual recognition. In ICML, pages 7374–7383. PMLR, 2019.
- Pyramid scene parsing network. In CVPR, pages 2881–2890, 2017.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, pages 6881–6890, 2021.
- Scene parsing through ade20k dataset. In CVPR, pages 633–641, 2017.
- Rethinking semantic segmentation: A prototype view. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2582–2593, 2022.