Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 54 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 105 tok/s Pro
Kimi K2 182 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4 40 tok/s Pro
2000 character limit reached

Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning (2312.02546v2)

Published 5 Dec 2023 in cs.CV

Abstract: Although vision models such as Contrastive Language-Image Pre-Training (CLIP) show impressive generalization performance, their zero-shot robustness is still limited under Out-of-Distribution (OOD) scenarios without fine-tuning. Instead of undesirably providing human supervision as commonly done, it is possible to take advantage of Multi-modal LLMs (MLLMs) that hold powerful visual understanding abilities. However, MLLMs are shown to struggle with vision problems due to the incompatibility of tasks, thus hindering their utilization. In this paper, we propose to effectively leverage MLLMs to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models. By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner. To solve the incompatibility issue, we propose a novel Denoising In-Context Learning (DICL) strategy to align vision tasks with MLLMs. Concretely, by estimating a transition matrix that captures the probability of one class being confused with another, an instruction containing a correct exemplar and an erroneous one from the most probable noisy class can be constructed. Such an instruction can help any MLLMs with ICL ability to detect and rectify incorrect predictions of vision models. Through extensive experiments on ImageNet, WILDS, DomainBed, and other OOD datasets, we carefully validate the quantitative and qualitative effectiveness of our method. Our code is available at https://github.com/tmllab/Machine_Vision_Therapy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, volume 35, pages 23716–23736, 2022.
  2. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Invariant rationalization. In International Conference on Machine Learning, pages 1448–1458. PMLR, 2020.
  5. Lightweight in-context tuning for multimodal unified models. arXiv preprint arXiv:2310.05109, 2023.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  7. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  8. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  9. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  10. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
  11. Viewfool: Evaluating the robustness of visual recognition to adversarial viewpoints. Advances in Neural Information Processing Systems, 35:36789–36803, 2022.
  12. Benchmarking robustness of 3d object detection to common corruptions in autonomous driving. arXiv preprint arXiv:2303.11040, 2023.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  14. Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022.
  15. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
  16. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, 25(5):845–869, 2013.
  17. Imagebind: One embedding space to bind them all. In CVPR, 2023.
  18. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
  19. Finetune like you pretrain: Improved finetuning of zero-shot vision models. In CVPR, pages 19338–19347, 2023.
  20. In search of lost domain generalization. arXiv preprint arXiv:2007.01434, 2020.
  21. In search of lost domain generalization. In ICLR, 2021.
  22. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS, volume 31, 2018.
  23. Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, 2019.
  24. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In ICLR, 2016.
  25. Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606, 2018.
  26. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021a.
  27. Natural adversarial examples. In CVPR, pages 15262–15271, 2021b.
  28. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  29. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023a.
  30. Winning prize comes from losing tickets: Improve invariant learning by exploring variant parameters for out-of-distribution generalization. arXiv preprint arXiv:2310.16391, 2023b.
  31. Harnessing out-of-distribution examples via augmenting content and style. In ICLR, 2023c.
  32. Robust generalization against photon-limited corruptions via worst-case sharpness minimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16175–16185, 2023d.
  33. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916. PMLR, 2021.
  34. WILDS: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning (ICML), 2021a.
  35. Wilds: A benchmark of in-the-wild distribution shifts. In ICML, pages 5637–5664. PMLR, 2021b.
  36. Learning multiple layers of features from tiny images. 2009.
  37. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  38. Mimic-it: Multi-modal in-context instruction tuning. 2023a.
  39. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023b.
  40. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  41. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
  42. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208, 2021.
  43. Scaling language-image pre-training via masking. In CVPR, pages 23390–23400, June 2023d.
  44. Visual instruction tuning. In NeurIPS, 2023.
  45. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2015.
  46. Energy-based out-of-distribution detection. In NeurIPS, volume 33, pages 21464–21475, 2020.
  47. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
  48. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
  49. Spawrious: A benchmark for fine control of spurious correlation biases. arXiv preprint arXiv:2303.05470, 2023.
  50. Domain generalization using causal matching. In International Conference on Machine Learning, pages 7313–7324. PMLR, 2021.
  51. Slip: Self-supervision meets language-image pre-training. In ECCV, pages 529–544. Springer, 2022.
  52. Learning with noisy labels. In NeurIPS, volume 26, 2013.
  53. OpenAI. Gpt-4 technical report, 2023.
  54. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  55. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pages 5389–5400. PMLR, 2019.
  56. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  57. Clipood: Generalizing clip to out-of-distributions. In ICML. PMLR, 2023.
  58. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  59. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  60. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  61. Learning robust global representations by penalizing local predictive power. In NeurIPS, volume 32, 2019.
  62. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
  63. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 568–578, 2021.
  64. Robust fine-tuning of zero-shot models. In CVPR, pages 7959–7971, 2022.
  65. Are anchor points really indispensable in label-noise learning? In NeurIPS, volume 32, 2019.
  66. Part-dependent label noise: Towards instance-dependent label noise. In NeurIPS, volume 33, pages 7597–7610, 2020.
  67. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
  68. Retrieval-augmented multimodal language modeling. In ICML, 2023.
  69. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  70. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6023–6032, 2019.
  71. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
  72. Investigating the catastrophic forgetting in multimodal large language models. arXiv preprint arXiv:2309.10313, 2023.
  73. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  74. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  75. Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915, 2023.
  76. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  77. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
  78. Understanding the robustness of 3d object detection with bird’s-eye-view representations in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21600–21610, 2023b.
Citations (4)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com