Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 65 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 80 tok/s Pro
Kimi K2 182 tok/s Pro
GPT OSS 120B 453 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language (2306.16410v1)

Published 28 Jun 2023 in cs.CL and cs.CV

Abstract: We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of LLMs. Our system uses a LLM to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. nocaps: novel object captioning at scale. In Proceedings of International Conference on Computer Vision. IEEE, oct 2019.
  2. Flamingo: a visual language model for few-shot learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Proceedings of Advances in Neural Information Processing Systems, 2022.
  3. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6077–6086, 2018.
  4. Food-101 – mining discriminative components with random forests. In Proceedings of European Conference on Computer Vision, 2014.
  5. High-performance large-scale image recognition without normalization. In Proceedings of International Conference on Machine Learning, pp.  1059–1071. PMLR, 2021.
  6. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  7. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  8. Describing textures in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014.
  9. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  248–255. Ieee, 2009.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2019.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2021.
  12. Hierarchical neural story generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp.  889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics.
  13. Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022.
  14. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. CVPR Workshop, 2004.
  15. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
  16. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6904–6913, 2017.
  17. From images to textual prompts: Zero-shot vqa with frozen large language models. arXiv preprint arXiv:2212.10846, 2023.
  18. LVIS: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  19. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  20. Unit: Multimodal multitask learning with a unified transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1439–1449, 2021.
  21. Promptcap: Prompt-guided image captioning for vqa with gpt-3. arXiv preprint arXiv:2211.09699, 2023.
  22. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  23. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2021.
  24. A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. arXiv preprint arXiv:2110.08484, 2022.
  25. The hateful memes challenge: Detecting hate speech in multimodal memes. Proceedings of Advances in Neural Information Processing Systems, 33:2611–2624, 2020.
  26. Vilt: Vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334, 2021.
  27. Grounding language models to images for multimodal inputs and outputs. Proceedings of International Conference on Machine Learning, 2023.
  28. 3d object representations for fine-grained categorization. In Proceedings of International IEEE Workshop on 3D Representation and Recognition, Sydney, Australia, 2013.
  29. Visual genome: Connecting language and vision using crowdsourced dense image annotations. 2016.
  30. Learning multiple layers of features from tiny images. 2009.
  31. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982, 2018.
  32. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. 2023.
  33. Caltech 101, Apr 2022a.
  34. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of International Conference on Machine Learning, pp.  12888–12900. PMLR, 2022b.
  35. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  36. Microsoft COCO: common objects in context. arXiv preprint arXiv:1405.0312, 2014.
  37. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
  38. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  39. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3195–3204, 2019.
  40. Visual classification via description from large language models. Proceedings of International Conference on Learning Representations, 2023.
  41. M-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
  42. OpenAI. Language models are unsupervised multitask learners. OpenAI Blog, 2019. URL https://openai.com/blog/better-language-models/.
  43. OpenAI. Chatgpt. https://openai.com/, 2021a. Accessed: Month Day, Year.
  44. OpenAI. Gpt. https://openai.com/blog/dall-e-3/, 2021b. Accessed on March 28, 2023.
  45. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  46. Cats and dogs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2012.
  47. Learning transferable visual models from natural language supervision. In Proceedings of International Conference on Machine Learning, pp.  8748–8763. PMLR, 2021.
  48. Imagenet large scale visual recognition challenge. Proceedings of International Journal on Computer Vision, 2015.
  49. How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021.
  50. FLAVA: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  51. Clip models are few-shot learners: Empirical studies on vqa and visual entailment. arXiv preprint arXiv:2203.07190, 2022.
  52. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. arXiv preprint arXiv:2112.06825, 2022.
  53. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
  54. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  55. Multimodal few-shot learning with frozen language models. arXiv preprint arXiv:2106.13884, 2021.
  56. BigScience Workshop. Bloom: A 176b-parameter open-access multilingual language model, 2023.
  57. Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3485–3492, June 2010. doi: 10.1109/CVPR.2010.5539970.
  58. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706, 2019.
  59. Visual clues: Bridging vision and language foundations for image paragraph captioning. arXiv preprint arXiv:2206.01843, 2022.
  60. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022a.
  61. Multi-grained vision language pre-training: Aligning texts with visual concepts. arXiv preprint arXiv:2111.08276, 2022b.
  62. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  63. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
  64. Multimodal c4: an open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023b.
Citations (48)

Summary

  • The paper introduces the LENS approach which leverages language models as reasoning modules integrated with standard vision components to perform visual tasks without additional pretraining.
  • The methodology employs vision modules such as CLIP and BLIP to generate rich textual descriptions, enabling effective object recognition and question answering.
  • Evaluations reveal that LENS achieves competitive performance in zero-shot and few-shot settings, surpassing models like Flamingo on benchmarks like VQA 2.0 while noting limitations on datasets like OK-VQA.

Towards LLMs That Can See: Computer Vision Through the Lens of Natural Language

The paper presents a modular approach, termed LENS, aiming to enhance the capabilities of LLMs by enabling them to tackle computer vision tasks via natural language descriptions. The authors propose using a LLM to interpret outputs from independently operating vision modules, which provide comprehensive textual information regarding images. This method stands out due to its pretraining-free nature, allowing application to any standard LLM without necessitating additional multimodal datasets.

Overview and Methodology

The research introduces LENS, which leverages LLMs as reasoning modules and integrates them with predefined vision components. This integration eliminates the conventional multimodal pretraining stages typically required for aligning visual and textual modalities, as observed in existing models like Flamingo and Kosmos-1. LENS utilizes vision modules such as CLIP for tag and attribute recognition, and BLIP for generating diverse image captions.

Critically, LENS capitalizes on the semantic reasoning capabilities inherent to LLMs. It processes detailed textual descriptions generated by vision modules to perform vision-related tasks, such as object recognition and question answering, without extra vision-and-language alignment.

Experimental Evaluation

The paper evaluates LENS comprehensively across both computer vision and vision-language benchmarks. The zero-shot object recognition task demonstrates that LENS configurations deliver competitive or superior performance compared to standalone models like CLIP. Notably, LENS, integrating ViT-H/14 as the visual backbone and Flan-T5\textsubscript{xxl} as the LLM, achieved enhancement over baseline CLIP models. Moreover, few-shot experiments highlight that increasing dataset shots and employing larger vision encoders further amplify performance.

In vision-language reasoning tasks, LENS shows robust results against state-of-the-art models that rely on extensive multimodal pretraining. Specifically, LENS with Flan-T5\textsubscript{XXL} surpassed several iterations of Flamingo on benchmarks like VQA 2.0, demonstrating its efficacy without the computational expense associated with training on large paired datasets. However, on certain datasets such as OK-VQA, LENS does lag behind, possibly due to less extensive LLM knowledge bases compared to models like Flamingo equipped with larger LLMs.

Implications and Future Directions

The paper illustrates significant theoretical and practical advancements, offering a streamlined method for incorporating vision capabilities into LLMs without additional training. It highlights the modular potential of LENS to enhance existing and future LLMs with minimal overhead, thereby pushing the boundaries of multimodal AI integration.

Future developments may involve extending this modular approach to other domains such as audio or video processing, enriching the cross-modal integration capabilities. Furthermore, optimizing the efficiency of integrating large LLMs with complex vision tasks remains an open area of exploration.

Conclusion

LENS stands as a promising shift in how vision and language domains are harmonized, showcasing competitive results in computer vision and reasoning tasks without the burdens of traditional multimodal pretraining. The modularity and applicability across diverse LLMs suggest a significant step toward more accessible and computationally efficient AI models, fostering further exploration and innovation in multimodal AI integration.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube