VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations (2207.00221v2)
Abstract: Vision-Language Pretraining (VLP) models have recently successfully facilitated many cross-modal downstream tasks. Most existing works evaluated their systems by comparing the fine-tuned downstream task performance. However, only average downstream task accuracy provides little information about the pros and cons of each VLP method, let alone provides insights on how the community can improve the systems in the future. Inspired by the CheckList for testing natural language processing, we exploit VL-CheckList, a novel framework to understand the capabilities of VLP models. The proposed method divides the image-texting ability of a VLP model into three categories: objects, attributes, and relations, and uses a novel taxonomy to further break down these three aspects. We conduct comprehensive studies to analyze seven recently popular VLP models via the proposed framework. Results confirm the effectiveness of the proposed method by revealing fine-grained differences among the compared models that were not visible from downstream task-only evaluation. Further results show promising research direction in building better VLP models. Our data and code are available at: https://github.com/om-ai-lab/VL-CheckList.
- Tide: A general toolbox for identifying object detection errors. In European Conference on Computer Vision, 558–573. Springer.
- A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3558–3568.
- Uniter: Universal image-text representation learning. In European conference on computer vision, 104–120. Springer.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
- An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18166–18176.
- Vision Checklist: Towards Testable Error Analysis of Image Models to Help System Designers Interrogate Model Capabilities. arXiv preprint arXiv:2201.11674.
- Decoupling the role of data, attention, and losses in multimodal transformers. Transactions of the Association for Computational Linguistics, 9: 570–585.
- Probing image-language transformers for verb understanding. arXiv preprint arXiv:2106.09141.
- MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1780–1790.
- Dynabench: Rethinking benchmarking in NLP. arXiv preprint arXiv:2104.14337.
- Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, 5583–5594. PMLR.
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, 123: 32–73.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1): 32–73.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34.
- VisualBERT: A Simple and Performant Baseline for Vision and Language. ArXiv, abs/1908.03557.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, 121–137. Springer.
- HAKE: Human Activity Knowledge Engine. ArXiv, abs/1904.06539.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
- VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 5020–5029.
- Learning to Predict Visual Attributes in the Wild. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13013–13023.
- Grounded Situation Recognition. ArXiv, abs/2003.12058.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763. PMLR.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
- Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv preprint arXiv:2005.04118.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, 618–626.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2556–2565.
- ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension. arXiv preprint arXiv:2204.05991.
- Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
- Attention is all you need. Advances in neural information processing systems, 30.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
- All in one: Exploring unified video-language pre-training. arXiv preprint arXiv:2203.07303.
- Vision-Language Pre-Training with Triple Contrastive Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15671–15680.
- From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6720–6731.
- Visual Commonsense in Pretrained Unimodal and Multimodal Models. arXiv preprint arXiv:2205.01850.
- Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5579–5588.