Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking Segmentation Models with Mask-Preserved Attribute Editing (2403.01231v2)

Published 2 Mar 2024 in cs.CV

Abstract: When deploying segmentation models in practice, it is critical to evaluate their behaviors in varied and complex scenes. Different from the previous evaluation paradigms only in consideration of global attribute variations (e.g. adverse weather), we investigate both local and global attribute variations for robustness evaluation. To achieve this, we construct a mask-preserved attribute editing pipeline to edit visual attributes of real images with precise control of structural information. Therefore, the original segmentation labels can be reused for the edited images. Using our pipeline, we construct a benchmark covering both object and image attributes (e.g. color, material, pattern, style). We evaluate a broad variety of semantic segmentation models, spanning from conventional close-set models to recent open-vocabulary large models on their robustness to different types of variations. We find that both local and global attribute variations affect segmentation performances, and the sensitivity of models diverges across different variation types. We argue that local attributes have the same importance as global attributes, and should be considered in the robustness evaluation of segmentation models. Code: https://github.com/PRIS-CV/Pascal-EA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. On the robustness of semantic segmentation models to adversarial attacks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 888–897, 2018.
  2. Diffusion visual counterfactual explanations. Advances in Neural Information Processing Systems, 35:364–377, 2022.
  3. Blended latent diffusion. ACM Transactions on Graphics (TOG), 42(4):1–11, 2023.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465, 2023.
  6. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
  7. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
  8. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
  9. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. arXiv preprint arXiv:2303.11797, 2023.
  10. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  11. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  12. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
  13. Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Conference on Computer Vision, pages 88–105. Springer, 2022.
  14. Reliability in semantic segmentation: Are we on the right track? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7173–7182, 2023.
  15. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
  16. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
  17. Describing objects by their attributes. In 2009 IEEE conference on computer vision and pattern recognition, pages 1778–1785. IEEE, 2009.
  18. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems, 22(3):1341–1360, 2020.
  19. Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8730–8738, 2018.
  20. Robust semantic segmentation with superpixel-mix. arXiv preprint arXiv:2108.00968, 2021.
  21. Adaptive testing of computer vision models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4003–4014, 2023.
  22. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
  23. Counterfactual visual explanations. In International Conference on Machine Learning, pages 2376–2384. PMLR, 2019.
  24. Riccardo Guidotti. Counterfactual explanations and how to find them: literature review and benchmarking. Data Mining and Knowledge Discovery, pages 1–55, 2022.
  25. Grounding visual explanations. In Proceedings of the European conference on computer vision (ECCV), pages 264–279, 2018.
  26. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
  27. Augmix: A simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781, 2019.
  28. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  29. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
  30. Probing intersectional biases in vision-language models with counterfactual examples. arXiv preprint arXiv:2310.02988, 2023.
  31. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pages 1501–1510, 2017.
  32. Text-to-image models for counterfactual explanations: a black-box approach. arXiv preprint arXiv:2309.07944, 2023.
  33. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  34. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  35. Cycle-consistent counterfactuals by latent transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10203–10212, 2022.
  36. Dense text-to-image generation with attention modulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7701–7711, 2023.
  37. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  38. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  39. Imagenet-e: Benchmarking neural network robustness via attribute editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20371–20381, 2023b.
  40. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023.
  41. Unifying visual attribute learning with object recognition in a multiplicative framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(7):1747–1760, 2019.
  42. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  43. Zero-shot model diagnosis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11631–11640, 2023.
  44. Segment anything model for medical image analysis: an experimental study. Medical Image Analysis, 89:102918, 2023.
  45. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  46. Multi-weather city: Adverse weather stacking for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2906–2915, 2021.
  47. Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
  48. Lance: Stress-testing visual models by generating language-guided counterfactual images. arXiv preprint arXiv:2305.19164, 2023.
  49. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  50. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  51. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  52. Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision, 126:973–992, 2018.
  53. Acdc: The adverse conditions dataset with correspondences for semantic driving scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10765–10775, 2021.
  54. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7262–7272, 2021.
  55. Shift: a synthetic driving dataset for continuous multi-task domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21371–21382, 2022.
  56. Trapped in texture bias? a large scale comparison of deep instance segmentation. In European Conference on Computer Vision, pages 609–627. Springer, 2022.
  57. Splicing vit features for semantic appearance transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10748–10757, 2022.
  58. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
  59. Dataset interfaces: Diagnosing model failures using controllable counterfactual generation. arXiv preprint arXiv:2302.07865, 2023.
  60. Scout: Self-aware discriminant counterfactual explanations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8981–8990, 2020.
  61. Discovering bugs in vision models using off-the-shelf image generation and captioning. arXiv preprint arXiv:2208.08831, 2022.
  62. Fastfcn: Rethinking dilated convolution in the backbone for semantic segmentation. arXiv preprint arXiv:1903.11816, 2019a.
  63. Detectron2. https://github.com/facebookresearch/detectron2, 2019b.
  64. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
  65. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2966, 2023.
  66. Cle diffusion: Controllable light enhancement diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia, pages 8145–8156, 2023.
  67. Object-contextual representations for semantic segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 173–190. Springer, 2020.
  68. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
  69. Octet: Object-aware counterfactual explanations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15062–15071, 2023.
  70. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  71. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
  72. Generalized decoding for pixel, image, and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15116–15127, 2023a.
  73. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zijin Yin (5 papers)
  2. Kongming Liang (29 papers)
  3. Bing Li (374 papers)
  4. Zhanyu Ma (103 papers)
  5. Jun Guo (130 papers)
Citations (1)

Summary

Benchmarking Segmentation Models with Mask-Preserved Attribute Editing

Overview

Benchmarking the robustness of segmentation models is pivotal, especially when evaluating their resilience against attribute variations. This paper introduces a novel mask-preserved attribute editing pipeline to assess segmentation models' robustness by considering both local and global attribute variations. By exploiting a pre-trained diffusion model and text instructions, it allows for precise editing of visual attributes in images while preserving structural information. Consequently, this approach facilitates the reuse of original segmentation labels for the edited images, making it a potentially transformative tool for the field.

Robustness Evaluation via Attribute Variations

The core challenge in evaluating segmentation model robustness has been the scarcity of high-quality, varied test data reflecting both local and global attribute changes. The traditional datasets and benchmarks inadequately address this need, as they mainly focus on global variations like weather conditions and fail to accommodate local attribute variations such as color, material, and pattern changes in objects within the scene.

The proposed pipeline addresses these limitations by allowing for the generation of test images encompassing a wide range of attribute variations. This paper's experiments demonstrate that both local and global attribute changes significantly impact segmentation performance. Notably, models have shown varying sensitivities to different types of attribute changes, with object material variations causing the most pronounced performance declines. These findings underscore the crucial role of object attributes in segmentation robustness, challenging the prevailing emphasis solely on global attributes.

Methodological Contributions

This research makes several notable contributions. Firstly, it introduces a mask-preserved attribute editing pipeline that enables the generation of images with varied attributes without the need for re-annotating segmentation labels. Secondly, it explores segmentation model robustness across a breadth of object and image attribute variations, providing a comprehensive assessment framework. Lastly, extensive experiments reveal that segmentation models exhibit varying degrees of sensitivity to different attribute variations, offering insights into model robustness and potential areas for improvement.

Future Directions

This work opens numerous avenues for further research. One promising direction is the refinement of the mask-preserved attribute editing technique to minimize spurious attribute changes during the editing process. Additionally, expanding the attribute set and conducting more granular analyses of model responses to specific attribute variations could yield deeper insights. Future work could also explore the integration of the proposed pipeline into the training process as a novel data augmentation strategy to enhance model robustness.

Conclusion

In conclusion, this paper's mask-preserved attribute editing pipeline represents a significant advancement in benchmarking segmentation model robustness. By facilitating the generation of test images with a wide range of attribute variations while preserving structural integrity, this approach addresses a critical gap in the evaluation of segmentation models. The findings highlight the importance of object attributes in segmentation robustness and suggest new pathways for enhancing model performance.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub