Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language (2306.16410v1)
Abstract: We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of LLMs. Our system uses a LLM to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo.
- nocaps: novel object captioning at scale. In Proceedings of International Conference on Computer Vision. IEEE, oct 2019.
- Flamingo: a visual language model for few-shot learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Proceedings of Advances in Neural Information Processing Systems, 2022.
- Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086, 2018.
- Food-101 – mining discriminative components with random forests. In Proceedings of European Conference on Computer Vision, 2014.
- High-performance large-scale image recognition without normalization. In Proceedings of International Conference on Machine Learning, pp. 1059–1071. PMLR, 2021.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Describing textures in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014.
- Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2021.
- Hierarchical neural story generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics.
- Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. CVPR Workshop, 2004.
- Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6904–6913, 2017.
- From images to textual prompts: Zero-shot vqa with frozen large language models. arXiv preprint arXiv:2212.10846, 2023.
- LVIS: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Unit: Multimodal multitask learning with a unified transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1439–1449, 2021.
- Promptcap: Prompt-guided image captioning for vqa with gpt-3. arXiv preprint arXiv:2211.09699, 2023.
- Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
- Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2021.
- A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. arXiv preprint arXiv:2110.08484, 2022.
- The hateful memes challenge: Detecting hate speech in multimodal memes. Proceedings of Advances in Neural Information Processing Systems, 33:2611–2624, 2020.
- Vilt: Vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334, 2021.
- Grounding language models to images for multimodal inputs and outputs. Proceedings of International Conference on Machine Learning, 2023.
- 3d object representations for fine-grained categorization. In Proceedings of International IEEE Workshop on 3D Representation and Recognition, Sydney, Australia, 2013.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. 2016.
- Learning multiple layers of features from tiny images. 2009.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982, 2018.
- Obelisc: An open web-scale filtered dataset of interleaved image-text documents. 2023.
- Caltech 101, Apr 2022a.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of International Conference on Machine Learning, pp. 12888–12900. PMLR, 2022b.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Microsoft COCO: common objects in context. arXiv preprint arXiv:1405.0312, 2014.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
- Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3195–3204, 2019.
- Visual classification via description from large language models. Proceedings of International Conference on Learning Representations, 2023.
- M-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
- OpenAI. Language models are unsupervised multitask learners. OpenAI Blog, 2019. URL https://openai.com/blog/better-language-models/.
- OpenAI. Chatgpt. https://openai.com/, 2021a. Accessed: Month Day, Year.
- OpenAI. Gpt. https://openai.com/blog/dall-e-3/, 2021b. Accessed on March 28, 2023.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Cats and dogs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2012.
- Learning transferable visual models from natural language supervision. In Proceedings of International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
- Imagenet large scale visual recognition challenge. Proceedings of International Journal on Computer Vision, 2015.
- How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021.
- FLAVA: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Clip models are few-shot learners: Empirical studies on vqa and visual entailment. arXiv preprint arXiv:2203.07190, 2022.
- Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. arXiv preprint arXiv:2112.06825, 2022.
- Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Multimodal few-shot learning with frozen language models. arXiv preprint arXiv:2106.13884, 2021.
- BigScience Workshop. Bloom: A 176b-parameter open-access multilingual language model, 2023.
- Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3485–3492, June 2010. doi: 10.1109/CVPR.2010.5539970.
- Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706, 2019.
- Visual clues: Bridging vision and language foundations for image paragraph captioning. arXiv preprint arXiv:2206.01843, 2022.
- Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022a.
- Multi-grained vision language pre-training: Aligning texts with visual concepts. arXiv preprint arXiv:2111.08276, 2022b.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
- Multimodal c4: an open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023b.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.