Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Vision Check-up for Language Models (2401.01862v1)

Published 3 Jan 2024 in cs.CV, cs.CL, and cs.LG
A Vision Check-up for Language Models

Abstract: What does learning to model relationships between strings teach LLMs about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As LLMs lack the ability to consume or output visual information as pixels, we use code to represent images in our study. Although LLM-generated images do not look like natural images, results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach LLMs about numerous aspects of the visual world. Furthermore, experiments on self-supervised visual representation learning, utilizing images generated with text models, highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.

Understanding LLMs' Visual Knowledge

Unveiling Visual Concepts in LLMs

The paper presents a fascinating investigation into what LLMs understand about the visual world, despite being purely textual entities. It showcases that LLMs possess the uncanny ability to generate code that mimics visual concepts ranging from simple shapes to complex scenes. These representations are manifestly not pixel-based, as models like GPT-4 use code to create images without directly interacting with visual data. Further analysis indicates that LLMs are proficient in generating intricate scenes, laying the groundwork for their potential in recognizing and recreating visual relationships and elements.

Experimenting with Artificial Vision

The process involved asking LLMs to both generate and recognize code corresponding to visual concepts. These concepts were gathered into a structured dataset aptly named the Visual Aptitude Dataset. Experiments revealed that, while models can generate rich scenes, their capabilities vary by complexity—strong at conceptualizing objects and scenes described in text, but weaker at rendering specific properties like textures and shapes. Moreover, LLMs have been discovered to struggle more with recognition tasks, suggesting a disparity in their ability to create versus verify visual concepts.

Refining Image Generation through Textual Feedback

In a novel approach to enhance visual representation, LLMs' generative competencies were tested using text-based iterative feedback. LLMs improved their image-rendering code based on their previous outputs, reminiscent of self-critical thinking. This illustrated that LLMs harbor a more dynamic and malleable understanding of visual concepts than previously believed, possessing the ability to refine and correct their 'mental' images.

Advancing Vision Models with Text-Based Learning

Closing The Gap Between Text and Vision

A turning point in the paper's exploration is the demonstration that images produced by LLMs may serve as educational resources for developing vision systems. Pre-trained using these LLM-generated images, vision models performed adequately on tasks involving natural imagery, signifying a breakthrough: purely text-trained models successfully bridged the modality gap into the field of visual understanding. This outcome hints at a future where text-driven AI could effectively inform and train models in visual perception without direct exposure to visual data.

Summation and Implications

This paper underscores a few groundbreaking realizations: LLMs can encapsulate and convey intricate visual information through text, LLMs can refine their mimicry of visual aspects with conceptual feedback, and finally, they can contribute to teaching vision systems about the visual world, sans direct visual input. The implications for cross-modal AI developments are tremendous, opening doors for further nuanced research in AI's capacity to comprehend and reinterpret the world around us in ways that were once thought to be exclusive to human perception.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Githubcopilot. https://github.com/features/copilot.
  2. Can language models encode perceptual structure without grounding? a case study in color. In Conference on Computational Natural Language Learning, 2021.
  3. Improving fractal pre-training. CoRR, abs/2110.03091, 2021.
  4. Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
  5. Procedural image programs for representation learning. In Advances in Neural Information Processing Systems, 2022.
  6. Learning to see by looking at noise. Advances in Neural Information Processing Systems, 34:2556–2569, 2021.
  7. Performance of optical flow techniques. International journal of computer vision, 12:43–77, 1994.
  8. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
  9. Constructing taxonomies from pretrained language models. arXiv preprint arXiv:2010.12813, 2020a.
  10. Evaluating large language models trained on code. ArXiv, abs/2107.03374, 2021.
  11. Improved baselines with momentum contrastive learning. CoRR, abs/2003.04297, 2020b.
  12. Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1841–1850, 2019.
  13. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  14. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015.
  15. Magma–multimodal augmentation of generative models through adapter-based finetuning. arXiv preprint arXiv:2112.05253, 2021.
  16. Do language models have coherent mental models of everyday things? arXiv preprint arXiv:2212.10029, 2022.
  17. Deep Residual Learning for Image Recognition. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778. IEEE, 2016.
  18. José Hernández-Orallo. Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement. Artificial Intelligence Review, 48:397–447, 2017.
  19. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.
  20. Ptr: A benchmark for part-based conceptual, relational, and physical reasoning. Advances in Neural Information Processing Systems, 34:17427–17440, 2021.
  21. Pre-training without natural images. CoRR, abs/2101.08515, 2021.
  22. Implicit representations of meaning in neural language models. arXiv preprint arXiv:2106.00737, 2021.
  23. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  24. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  25. Dissociating language and thought in large language models: a cognitive perspective. arXiv preprint arXiv:2301.06627, 2023.
  26. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4040–4048, 2016.
  27. Linearly mapping from image to text space. arXiv preprint arXiv:2209.15162, 2022.
  28. Mapping language models to grounded conceptual spaces. In International Conference on Learning Representations, 2022.
  29. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  30. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3234–3243, 2016.
  31. On rendering synthetic images for training an object detector. Computer Vision and Image Understanding, 137:24–37, 2015.
  32. Code llama: Open foundation models for code, 2023.
  33. Fake it till you make it: Learning transferable representations from synthetic imagenet clones. In CVPR 2023–IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  34. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  35. Visual atoms: Pre-training vision transformers with sinusoidal waves. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18579–18588, 2023.
  36. What do you learn from context? probing for sentence structure in contextualized word representations. arXiv preprint arXiv:1905.06316, 2019.
  37. Zero-shot image-to-text generation for visual-semantic arithmetic. arXiv preprint arXiv:2111.14447, 2021.
  38. Contrastive multiview coding. CoRR, abs/1906.05849, 2019.
  39. Stablerep: Synthetic images from text-to-image models make strong visual representation learners. arXiv preprint arXiv:2306.00984, 2023.
  40. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
  41. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  42. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  43. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
  44. mixup: Beyond empirical risk minimization. CoRR, abs/1710.09412, 2017.
  45. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  46. A survey on language models for code, 2023.
  47. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
  48. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127:302–321, 2019.
  49. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Pratyusha Sharma (15 papers)
  2. Tamar Rott Shaham (14 papers)
  3. Manel Baradad (6 papers)
  4. Stephanie Fu (11 papers)
  5. Adrian Rodriguez-Munoz (5 papers)
  6. Shivam Duggal (9 papers)
  7. Phillip Isola (84 papers)
  8. Antonio Torralba (178 papers)
Citations (17)
Youtube Logo Streamline Icon: https://streamlinehq.com