A Vision Check-up for Language Models (2401.01862v1)

Published 3 Jan 2024 in cs.CV, cs.CL, and cs.LG

Abstract: What does learning to model relationships between strings teach LLMs about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As LLMs lack the ability to consume or output visual information as pixels, we use code to represent images in our study. Although LLM-generated images do not look like natural images, results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach LLMs about numerous aspects of the visual world. Furthermore, experiments on self-supervised visual representation learning, utilizing images generated with text models, highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.

PDF HTML Abstract

Understanding LLMs' Visual Knowledge

Unveiling Visual Concepts in LLMs

The paper presents a fascinating investigation into what LLMs understand about the visual world, despite being purely textual entities. It showcases that LLMs possess the uncanny ability to generate code that mimics visual concepts ranging from simple shapes to complex scenes. These representations are manifestly not pixel-based, as models like GPT-4 use code to create images without directly interacting with visual data. Further analysis indicates that LLMs are proficient in generating intricate scenes, laying the groundwork for their potential in recognizing and recreating visual relationships and elements.

Experimenting with Artificial Vision

The process involved asking LLMs to both generate and recognize code corresponding to visual concepts. These concepts were gathered into a structured dataset aptly named the Visual Aptitude Dataset. Experiments revealed that, while models can generate rich scenes, their capabilities vary by complexity—strong at conceptualizing objects and scenes described in text, but weaker at rendering specific properties like textures and shapes. Moreover, LLMs have been discovered to struggle more with recognition tasks, suggesting a disparity in their ability to create versus verify visual concepts.

Refining Image Generation through Textual Feedback

In a novel approach to enhance visual representation, LLMs' generative competencies were tested using text-based iterative feedback. LLMs improved their image-rendering code based on their previous outputs, reminiscent of self-critical thinking. This illustrated that LLMs harbor a more dynamic and malleable understanding of visual concepts than previously believed, possessing the ability to refine and correct their 'mental' images.

Advancing Vision Models with Text-Based Learning

Closing The Gap Between Text and Vision

A turning point in the paper's exploration is the demonstration that images produced by LLMs may serve as educational resources for developing vision systems. Pre-trained using these LLM-generated images, vision models performed adequately on tasks involving natural imagery, signifying a breakthrough: purely text-trained models successfully bridged the modality gap into the field of visual understanding. This outcome hints at a future where text-driven AI could effectively inform and train models in visual perception without direct exposure to visual data.

Summation and Implications

This paper underscores a few groundbreaking realizations: LLMs can encapsulate and convey intricate visual information through text, LLMs can refine their mimicry of visual aspects with conceptual feedback, and finally, they can contribute to teaching vision systems about the visual world, sans direct visual input. The implications for cross-modal AI developments are tremendous, opening doors for further nuanced research in AI's capacity to comprehend and reinterpret the world around us in ways that were once thought to be exclusive to human perception.