Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robustness Analysis on Foundational Segmentation Models (2306.09278v2)

Published 15 Jun 2023 in cs.CV

Abstract: Due to the increase in computational resources and accessibility of data, an increase in large, deep learning models trained on copious amounts of multi-modal data using self-supervised or semi-supervised learning have emerged. These ``foundation'' models are often adapted to a variety of downstream tasks like classification, object detection, and segmentation with little-to-no training on the target dataset. In this work, we perform a robustness analysis of Visual Foundation Models (VFMs) for segmentation tasks and focus on robustness against real-world distribution shift inspired perturbations. We benchmark seven state-of-the-art segmentation architectures using 2 different perturbed datasets, MS COCO-P and ADE20K-P, with 17 different perturbations with 5 severity levels each. Our findings reveal several key insights: (1) VFMs exhibit vulnerabilities to compression-induced corruptions, (2) despite not outpacing all of unimodal models in robustness, multimodal models show competitive resilience in zero-shot scenarios, and (3) VFMs demonstrate enhanced robustness for certain object categories. These observations suggest that our robustness evaluation framework sets new requirements for foundational models, encouraging further advancements to bolster their adaptability and performance. The code and dataset is available at: \url{https://tinyurl.com/fm-robust}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Imagebind: One embedding space to bind them all. In CVPR, pages 15180–15190, 2023.
  2. Understanding robustness of transformers for image classification. In ICCV, pages 10231–10241, 2021.
  3. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
  4. Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021a.
  5. Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021b.
  6. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI, 40(4):834–848, 2017.
  7. A simple framework for contrastive learning of visual representations. pages 1597–1607. PMLR, 2020.
  8. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022.
  9. Per-pixel classification is not all you need for semantic segmentation. 2021.
  10. Masked-attention mask transformer for universal image segmentation. 2022.
  11. Robust analysis of feature spaces: Color image segmentation. In CVPR, pages 750–755. IEEE, 1997.
  12. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  13. Segment anything model (sam) for digital pathology: Assess zero-shot segmentation on whole slide imaging. arXiv preprint arXiv:2304.04155, 2023.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  16. Improving zero-shot generalization and robustness of multi-modal models. In CVPR, pages 11093–11101, 2023a.
  17. Building one-class detector for anything: Open-vocabulary zero-shot ood detection using text-image models. arXiv preprint arXiv:2305.17207, 2023b.
  18. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
  19. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  20. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
  21. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
  22. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
  23. Using self-supervised learning can improve model robustness and uncertainty. In NeurIPS. Curran Associates, Inc., 2019a.
  24. Augmix: A simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781, 2019b.
  25. imgaug. https://github.com/aleju/imgaug, 2020. Online; accessed 01-Feb-2020.
  26. Benchmarking the robustness of semantic segmentation models. In CVPR, pages 8828–8838, 2020.
  27. Benchmarking the robustness of semantic segmentation models with respect to common corruptions. IJCV, 129:462–483, 2021.
  28. Panoptic segmentation. In CVPR, pages 9404–9413, 2019.
  29. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  30. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In CVPR, 2023.
  31. Grounded language-image pre-training. In CVPR, pages 10965–10975, 2022.
  32. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
  33. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023a.
  34. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b.
  35. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  36. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
  37. Segment anything in medical images. arXiv preprint arXiv:2304.12306, 2023.
  38. Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484, 2019.
  39. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  40. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  41. Styleclip: Text-driven manipulation of stylegan imagery. In ICCV, pages 2085–2094, 2021.
  42. Open world entity segmentation. IEEE TPAMI, 2022.
  43. Learning transferable visual models from natural language supervision. pages 8748–8763. PMLR, 2021a.
  44. Learning transferable visual models from natural language supervision. pages 8748–8763. PMLR, 2021b.
  45. Grounded sam: Assembling open-world models for diverse visual tasks, 2024.
  46. RockeyCoss. Prompt-segment-anything. GitHub repository, 2023.
  47. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  48. Robustness analysis of video-language models against visual and language perturbations. In NeurIPS, pages 34405–34420. Curran Associates, Inc., 2022.
  49. A large-scale robustness analysis of video action recognition models. In CVPR, pages 14698–14708, 2023a.
  50. A large-scale robustness analysis of video action recognition models. In CVPR, pages 14698–14708, 2023b.
  51. Measuring robustness to natural distribution shifts in image classification. NeurIPS, 33:18583–18599, 2020.
  52. Attention is all you need. NeurIPS, 30, 2017.
  53. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In CVPR, pages 14408–14419, 2023a.
  54. Images speak in images: A generalist painter for in-context visual learning. In CVPR, pages 6830–6839, 2023b.
  55. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
  56. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023.
  57. Focal modulation networks, 2022.
  58. Track anything: Segment anything meets videos, 2023.
  59. A fourier perspective on model robustness in computer vision. NeurIPS, 32, 2019.
  60. Inserting anybody in diffusion models via celeb basis. arXiv preprint arXiv:2306.00926, 2023.
  61. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  62. A simple framework for open-vocabulary segmentation and detection. arXiv preprint arXiv:2303.08131, 2023.
  63. Conditional random fields as recurrent neural networks. In ICCV, 2015.
  64. Scene parsing through ade20k dataset. In CVPR, pages 633–641, 2017.
  65. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023.

Summary

We haven't generated a summary for this paper yet.