Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bridge the Modality and Capability Gaps in Vision-Language Model Selection (2403.13797v2)

Published 20 Mar 2024 in cs.LG and cs.CV

Abstract: Vision LLMs (VLMs) excel in zero-shot image classification by pairing images with textual category names. The expanding variety of Pre-Trained VLMs enhances the likelihood of identifying a suitable VLM for specific tasks. To better reuse the VLM resource and fully leverage its potential on different zero-shot image classification tasks, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo, relying solely on the text data of the target dataset without access to the dataset's images. In this paper, we analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection: the "Modality Gap" - the disparity in VLM's embeddings across two different modalities, making text a less reliable substitute for images; and the "Capability Gap" - the discrepancy between the VLM's overall ranking and its ranking for target dataset, hindering direct prediction of a model's dataset-specific performance from its general performance. We propose VLM Selection With gAp Bridging (SWAB) to mitigate the negative impact of two gaps. SWAB first adopts optimal transport to capture the relevance between open-source and target datasets with a transportation matrix. It then uses this matrix to transfer useful statistics of VLMs from open-source datasets to the target dataset for bridging two gaps. By bridging two gaps to obtain better substitutes for test images, SWAB can accurately predict the performance ranking of different VLMs on the target task without the need for the dataset's images. Experiments across various VLMs and image classification datasets validate SWAB's effectiveness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. An information-theoretic approach to transferability in task transfer learning. In ICIP, 2019.
  2. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE, 2017.
  3. Describing textures in the wild. In CVPR, 2014.
  4. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, 2011.
  5. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In NeurIPS, 2013.
  6. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  7. Pactran: Pac-bayesian metrics for estimating the transferability of pretrained models to classification tasks. In ECCV, 2022.
  8. The pascal visual object classes challenge 2007 (voc2007) results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html, 2007.
  9. Data determines distributional robustness in contrastive language image pre-training (clip). In ICML, 2022.
  10. Pot: Python optimal transport. JMLR, 2021.
  11. Improving zero-shot generalization and robustness of multi-modal models. In CVPR, 2023.
  12. Vision meets robotics: The kitti dataset. IJRR, 2013.
  13. Challenges in representation learning: Facial expression recognition challenge, 2013.
  14. Deep residual learning for image recognition. In CVPR, 2016.
  15. Uniformly distributed category prototype-guided vision-language framework for long-tail recognition. In ACM MM, 2023.
  16. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. J-STARS, 2019.
  17. Frustratingly easy transferability estimation. In ICML, 2022.
  18. Openclip, 2021. URL https://doi.org/10.5281/zenodo.5143773.
  19. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  20. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
  21. Kaggle and EyePacs. Kaggle diabetic retinopathy detection, 2015. URL https://www.kaggle.com/c/diabetic-retinopathy-detection/data.
  22. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), 2013.
  23. Learning multiple layers of features from tiny images. Technical report, 2009.
  24. Mnist handwritten digit database. ATT Labs, 2010.
  25. Grounded language-image pre-training. In CVPR, 2022.
  26. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In NeurIPS, 2022.
  27. Microsoft coco: Common objects in context. In ECCV, 2014.
  28. A convnet for the 2020s. In CVPR, 2022.
  29. A simple long-tailed recognition baseline via vision-language model. CoRR, abs/2111.14745, 2021.
  30. Fine-grained visual classification of aircraft. CoRR, abs/1306.5151, 2013.
  31. Doubly right object recognition: A why prompt for visual rationales. In CVPR, 2023.
  32. Visual classification via description from large language models. In ICLR, 2023.
  33. Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop, 2011.
  34. Automated flower classification over a large number of classes. In ICVGIP, 2008.
  35. Training language models to follow instructions with human feedback. NeurIPS, 2022.
  36. Cats and dogs. In CVPR, 2012.
  37. Computational optimal transport. Foundations and Trends in Machine Learning, 2019.
  38. Learning transferable visual models from natural language supervision. In ICML, 2021.
  39. Laion-400m: Open dataset of clip-filtered 400 million image-text pair. CoRR, abs/2111.02114, 2021.
  40. Flava: A foundational language and vision alignment model. In CVPR, 2022.
  41. Mpnet: Masked and permuted pre-training for language understanding. In NeurIPS, pp.  16857–16867, 2020.
  42. The german traffic sign recognition benchmark: a multi-class classification competition. In IJCNN, 2011.
  43. Transferability and hardness of supervised classification tasks. In ICCV, 2019.
  44. Leep: A new measure to evaluate transferability of learned representations. In ICML, 2020.
  45. Attention is all you need. In NeurIPS, 2017.
  46. Rotation equivariant CNNs for digital pathology, 2018.
  47. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.
  48. Learning mahalanobis distance metric: Considering instance disturbance helps. In IJCAI, 2017.
  49. Rectify heterogeneous models with semantic mapping. In ICML, 2018.
  50. Heterogeneous few-shot model rectification with semantic mapping. TPAMI, 2021.
  51. Leveraging cross-modal neighbor representation for improved clip classification. In CVPR, 2024.
  52. Logme: Practical assessment of pre-trained models for transfer learning. In ICML, 2021.
  53. Ranking and tuning pre-trained models: a new paradigm for exploiting model hubs. JMLR, 2022.
  54. Coca: Contrastive captioners are image-text foundation models. TMLR, 2022.
  55. Florence: A new foundation model for computer vision. CoRR, abs/2111.11432, 2021.
  56. A large-scale study of representation learning with the visual task adaptation benchmark, 2020.
  57. Model spider: Learning to rank pre-trained models efficiently. NeurIPS, 2023.
  58. Lovm: Language-only vision model selection. In NeurIPS, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chao Yi (11 papers)
  2. De-Chuan Zhan (89 papers)
  3. Han-Jia Ye (73 papers)
  4. Yu-Hang He (5 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets