Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LOVM: Language-Only Vision Model Selection (2306.08893v1)

Published 15 Jun 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Pre-trained multi-modal vision-LLMs (VLMs) are becoming increasingly popular due to their exceptional performance on downstream vision applications, particularly in the few- and zero-shot settings. However, selecting the best-performing VLM for some downstream applications is non-trivial, as it is dataset and task-dependent. Meanwhile, the exhaustive evaluation of all available VLMs on a novel application is not only time and computationally demanding but also necessitates the collection of a labeled dataset for evaluation. As the number of open-source VLM variants increases, there is a need for an efficient model selection strategy that does not require access to a curated evaluation dataset. This paper proposes a novel task and benchmark for efficiently evaluating VLMs' zero-shot performance on downstream applications without access to the downstream task dataset. Specifically, we introduce a new task LOVM: Language-Only Vision Model Selection, where methods are expected to perform both model selection and performance prediction based solely on a text description of the desired downstream application. We then introduced an extensive LOVM benchmark consisting of ground-truth evaluations of 35 pre-trained VLMs and 23 datasets, where methods are expected to rank the pre-trained VLMs and predict their zero-shot performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. How stable are transferability metrics evaluations? In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 303–321, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19830-4.
  2. VLMo: Unified vision-language pre-training with mixture-of-modality-experts. In Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=bydKs84JEyw.
  3. Detecting errors and estimating accuracy on unlabeled data with self-training ensembles. Advances in Neural Information Processing Systems, 34:14980–14992, 2021.
  4. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, Oct 2017. ISSN 1558-2256. doi: 10.1109/jproc.2017.2675998. URL http://dx.doi.org/10.1109/JPROC.2017.2675998.
  5. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
  6. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011.
  7. Measuring dataset granularity, 2019. URL https://arxiv.org/abs/1912.10154.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  9. Are labels always necessary for classifier accuracy evaluation? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15069–15078, June 2021.
  10. Pactran: Pac-bayesian metrics for estimating the transferability of pretrained models to classification tasks. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, ECCV 2022, pages 252–268, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19830-4.
  11. Yoshua Bengio Dumitru Ian Goodfellow, Will Cukierski. Challenges in representation learning: Facial expression recognition challenge, 2013.
  12. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information, 2021. URL https://arxiv.org/abs/2110.08420.
  13. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html, 2007.
  14. Domino: Discovering systematic errors with cross-modal embeddings. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=FPCMqjI0jXN.
  15. Data determines distributional robustness in contrastive language image pre-training (clip), 2022. URL https://arxiv.org/abs/2205.01397.
  16. Towards better selective classification, 2022. URL https://arxiv.org/abs/2206.09034.
  17. Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179–188, 1936.
  18. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013.
  19. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019.
  20. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773.
  21. Distilling model failures as directions in latent space, 2022. URL https://arxiv.org/abs/2206.14754.
  22. Scaling up visual and vision-language representation learning with noisy text supervision. CoRR, abs/2102.05918, 2021. URL https://arxiv.org/abs/2102.05918.
  23. Assessing generalization of SGD via disagreement. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=WvOGCEAQhxl.
  24. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
  25. Kaggle and EyePacs. Kaggle diabetic retinopathy detection, 2015. URL https://www.kaggle.com/c/diabetic-retinopathy-detection/data.
  26. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
  27. Learning multiple layers of features from tiny images, 2009.
  28. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
  29. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In NeurIPS, 2022. URL https://openreview.net/forum?id=S7Evzt9uit3.
  30. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
  31. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
  32. OpenAI. Gpt-4 technical report, 2023.
  33. Transferability estimation using bhattacharyya class separability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9172–9182, June 2022.
  34. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  35. Estimating accuracy from unlabeled data: A probabilistic logic approach. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/95f8d9901ca8878e291552f001f67692-Paper.pdf.
  36. Estimating accuracy from unlabeled data: A bayesian approach. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1416–1425, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/platanios16.html.
  37. Transferability estimation using bhattacharyya class separability. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9162–9172, 2022. doi: 10.1109/CVPR52688.2022.00896.
  38. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021. URL https://arxiv.org/abs/2103.00020.
  39. What can we learn by predicting accuracy? In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2390–2399, January 2023.
  40. Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65, 1987. ISSN 0377-0427. doi: https://doi.org/10.1016/0377-0427(87)90125-7. URL https://www.sciencedirect.com/science/article/pii/0377042787901257.
  41. Efficient image dataset classification difficulty estimation for predicting deep-learning accuracy, 2018. URL https://arxiv.org/abs/1803.09588.
  42. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=M3Y74vmsMcY.
  43. The german traffic sign recognition benchmark: a multi-class classification competition. In The 2011 international joint conference on neural networks, pages 1453–1460. IEEE, 2011.
  44. Ranking models in unlabeled new environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11761–11771, 2021.
  45. Llama: Open and efficient foundation language models, 2023. URL https://arxiv.org/abs/2302.13971.
  46. Predicting neural network accuracy from weights, 2020. URL https://arxiv.org/abs/2002.11448.
  47. Neural data-to-text generation based on small datasets: Comparing the added value of two semi-supervised learning approaches on top of a large language model, 2022. URL https://arxiv.org/abs/2207.06839.
  48. Rotation equivariant CNNs for digital pathology, June 2018.
  49. Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  50. Assessing the value of transfer learning metrics for rf domain adaptation, 2022.
  51. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492, June 2010. doi: 10.1109/CVPR.2010.5539970.
  52. A large-scale study of representation learning with the visual task adaptation benchmark, 2020.
  53. Diagnosing and rectifying vision models using language. arXiv preprint arXiv:2302.04269, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Orr Zohar (9 papers)
  2. Shih-Cheng Huang (17 papers)
  3. Kuan-Chieh Wang (30 papers)
  4. Serena Yeung (39 papers)
Citations (10)