Vision Language Transformers: A Survey (2307.03254v1)
Abstract: Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform. A relatively recent body of research has adapted the pretrained transformer architecture introduced in \citet{vaswani2017attention} to vision LLMing. Transformer models have greatly improved performance and versatility over previous vision LLMs. They do so by pretraining models on a large generic datasets and transferring their learning to new tasks with minor changes in architecture and parameter values. This type of transfer learning has become the standard modeling practice in both natural language processing and computer vision. Vision language transformers offer the promise of producing similar advancements in tasks which require both vision and language. In this paper, we provide a broad synthesis of the currently available research on vision language transformer models and offer some analysis of their strengths, limitations and some open questions that remain.
- Nocaps: Novel object captioning at scale. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE.
- Flamingo: A visual language model for few-shot learning.
- VQA: Visual question answering. In 2015 IEEE International Conference on Computer Vision (ICCV). IEEE.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254.
- Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language berts. Transactions of the Association for Computational Linguistics, 9:978–994.
- End-to-end object detection with transformers. In Computer Vision – ECCV 2020, Lecture notes in computer science, pages 213–229. Springer International Publishing, Cham.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568.
- Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345.
- Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794.
- Microsoft COCO captions: Data collection and evaluation server.
- UNITER: Learning UNiversal Image-TExt representations.
- Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pages 1931–1942. PMLR.
- CoAtNet: Marrying convolution and attention for all data sizes.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Prefix language models are unified modal learners. arXiv preprint arXiv:2206.07699.
- CSWin transformer: A general vision transformer backbone with cross-shaped windows.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18166–18176.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
- Large-scale adversarial training for vision-and-language representation learning. Advances in Neural Information Processing Systems, 33:6616–6628.
- Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6639–6648.
- Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448.
- Towards general purpose vision systems.
- Stevan Harnad. 1990. The symbol grounding problem. Physica D, 42(1):335–346.
- Deep residual learning for image recognition.
- Scaling up vision-language pretraining for image captioning. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
- Seeing out of the box: End-to-End pre-training for Vision-Language representation learning.
- Seeing out of the box: End-to-end pre-training for vision-language representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12976–12985.
- Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR.
- Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77.
- Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790.
- Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798.
- Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73.
- Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
- Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
- mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005.
- Lavis: A library for language-vision intelligence.
- Unicoder-VL: A universal encoder for vision and language by Cross-Modal Pre-Training. AAAI, 34(07):11336–11344.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
- Muchen Li and Leonid Sigal. 2021. Referring transformer: A one-step approach to multi-task visual grounding.
- UNIMO: Towards Unified-Modal understanding and generation via Cross-Modal contrastive learning.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer.
- Microsoft COCO: Common objects in context. In Computer Vision – ECCV 2014, pages 740–755. Springer International Publishing.
- DQ-DETR: Dual query detection transformer for phrase extraction and grounding.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- KD-VLP: Improving end-to-end vision-and-language pretraining with object knowledge distillation. In Findings of the Association for Computational Linguistics: NAACL 2022, Stroudsburg, PA, USA. Association for Computational Linguistics.
- Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst.
- OpenAI. 2023. Gpt-4 technical report.
- Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24.
- Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
- Improving language understanding by generative pre-training.
- Zero-shot text-to-image generation.
- You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114.
- Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650.
- VL-BERT: Pre-training of generic Visual-Linguistic representations.
- A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491.
- Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality encoder representations from transformers.
- Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73.
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR.
- Attention is all you need. Advances in neural information processing systems, 30.
- GLUE: A Multi-Task benchmark and analysis platform for natural language understanding.
- Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100.
- Position-guided text prompt for vision-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23242–23251.
- Omnivl: One foundation model for image-language and video-language tasks. arXiv preprint arXiv:2209.07526.
- One-peace: Exploring one general representation model toward unlimited modalities. arXiv preprint arXiv:2305.11172.
- OFA: Unifying architectures, tasks, and modalities through a simple Sequence-to-Sequence learning framework.
- Image as a foreign language: Beit pretraining for vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175–19186.
- SimVLM: Simple visual language model pretraining with weak supervision.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- E2e-vlp: end-to-end vision-language pre-training enhanced by visual learning. arXiv preprint arXiv:2106.01804.
- BridgeTower: Building bridges between encoders in Vision-Language representation learning.
- Unified contrastive learning in image-text-label space. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
- UniTAB: Unifying text and box outputs for grounded Vision-Language modeling. In Computer Vision – ECCV 2022, pages 521–539. Springer Nature Switzerland.
- CoCa: Contrastive captioners are Image-Text foundation models.
- Mattnet: Modular attention network for referring expression comprehension. CoRR, abs/1801.08186.
- Florence: A new foundation model for computer vision.
- From recognition to cognition: Visual commonsense reasoning.
- X22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT-VLM: All-In-One pre-trained model for Vision-Language tasks.
- Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133.
- Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5579–5588.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.
- Clayton Fields (2 papers)
- Casey Kennington (20 papers)