Visual Classification via Description from Large Language Models (2210.07183v2)

Published 13 Oct 2022 in cs.CV and cs.LG

Abstract: Vision-LLMs (VLMs) such as CLIP have shown promising performance on a variety of recognition tasks using the standard zero-shot classification procedure -- computing similarity between the query image and the embedded words for each category. By only using the category name, they neglect to make use of the rich context of additional information that language affords. The procedure gives no intermediate understanding of why a category is chosen, and furthermore provides no mechanism for adjusting the criteria used towards this decision. We present an alternative framework for classification with VLMs, which we call classification by description. We ask VLMs to check for descriptive features rather than broad categories: to find a tiger, look for its stripes; its claws; and more. By basing decisions on these descriptors, we can provide additional cues that encourage using the features we want to be used. In the process, we can get a clear idea of what features the model uses to construct its decision; it gains some level of inherent explainability. We query LLMs (e.g., GPT-3) for these descriptors to obtain them in a scalable way. Extensive experiments show our framework has numerous advantages past interpretability. We show improvements in accuracy on ImageNet across distribution shifts; demonstrate the ability to adapt VLMs to recognize concepts unseen during training; and illustrate how descriptors can be edited to effectively mitigate bias compared to the baseline.

PDF Abstract

Visual Classification via Description from LLMs

The paper "Visual Classification via Description from LLMs" by Sachit Menon and Carl Vondrick presents an innovative approach to visual recognition using Vision-LLMs (VLMs). This method diverges from the traditional zero-shot classification strategy employed by existing VLMs like CLIP, which determines category similarity solely through category names. Instead, it leverages the rich descriptive capabilities of LLMs such as GPT-3 to enhance interpretability and accuracy.

Methodology

The authors propose a "classification by description" framework that queries VLMs for specific descriptive features associated with visual categories. This involves moving away from broad category names and focusing on descriptive attributes, such as "a tiger's stripes and claws," to guide the classification process. This approach exploits the fine-grained knowledge embedded in LLMs, which can be drawn upon to generate detailed descriptors that are otherwise laborious and inefficient to manually curate.

The method systematically develops a set of descriptors $D(c)$ for each category $c$ , which is then used to compute a score $s(c, x)$ for an image $x$ , leading to classification based on the cumulative similarity between the image and these descriptive features. Notably, this approach upholds performance without requiring additional training, maintaining computational efficiency.

Results

Experiments illustrated the robustness of this classifier across various datasets such as ImageNet, ImageNetV2, and other fine-grained and texture datasets. The framework demonstrated up to $\sim 4\%$ to $\sim 5\%$ improvements in top-1 accuracy on ImageNet classifications over traditional methods—and even more in certain cross-domain tasks such as EuroSAT dataset for satellite images.

Moreover, the paper shows how the descriptors offer inherent interpretability of model decisions by revealing which descriptors are activated, hence explaining the reasons behind the model's predictions. This interpretability is further exploited to adapt the model to recognize novel categories, unseen during the VLM's initial training, as shown with examples like classifying the Ever Given ship post-training CLIP.

Implications and Future Directions

The implications of utilizing LLMs for vision tasks are profound, suggesting that the deployment of pre-trained LLMs could inject a new wave of adaptability and robustness into visual recognition systems. The paper effectively showcases how the system can mitigate biases by adjusting the descriptors and ensuring balanced representation, a significant evidence of model reprogrammability and inclusivity-awareness.

Moving forward, this framework could catalyze innovation in handling more nuanced and challenging classification tasks. The potential to further distill domain-specific and real-time descriptors from LLMs offers vast applications, including enhancing AI operational transparency and control, essential for deploying AI in sensitive decision-making domains.

Collating the advancements reported here, future research could explore exploring multimodal integration and constraints optimization, refining the fusion of language and vision in AI systems, and thus broadening the horizons of what intelligent systems can achieve. Furthermore, augmenting this approach with active learning strategies might enable even more efficient learning processes in environments where data collection comes with substantial cost or complexity.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Sachit Menon (12 papers)
Carl Vondrick (93 papers)

Citations (236)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/tarun_05/status/1785325528138039735

YouTube

Show All Videos