Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment (2302.00902v2)

Published 2 Feb 2023 in cs.LG, cs.CL, and cs.CV

Abstract: Recent progress in scaling up LLMs has shown impressive capabilities in performing few-shot learning across a wide range of text-based tasks. However, a key limitation is that these LLMs fundamentally lack visual perception - a crucial attribute needed to extend these models to be able to interact with the real world and solve vision tasks, such as in visual-question answering and robotics. Prior works have largely connected image to text through pretraining and/or fine-tuning on curated image-text datasets, which can be a costly and expensive process. In order to resolve this limitation, we propose a simple yet effective approach called Language-Quantized AutoEncoder (LQAE), a modification of VQ-VAE that learns to align text-image data in an unsupervised manner by leveraging pretrained LLMs (e.g., BERT, RoBERTa). Our main idea is to encode image as sequences of text tokens by directly quantizing image embeddings using a pretrained language codebook. We then apply random masking followed by a BERT model, and have the decoder reconstruct the original image from BERT predicted text token embeddings. By doing so, LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs. This enables few-shot image classification with LLMs (e.g., GPT-3) as well as linear classification of images based on BERT text features. To the best of our knowledge, our work is the first work that uses unaligned images for multimodal tasks by leveraging the power of pretrained LLMs.

PDF Abstract

Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment

In the field of machine learning, the paper titled "Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment" proposes a novel approach to overcome the limitations of LLMs, such as GPT-3 and BERT, which lack visual perception. This deficiency restricts their applicability in domains requiring visual inputs, such as visual question answering (VQA) and robotics. The authors introduce a method called Language-Quantized AutoEncoder (LQAE), which represents a significant shift from traditional reliance on curated image-text datasets for multimodal learning.

Methodology

The core idea behind LQAE is to encode images as sequences of text tokens by directly quantizing image embeddings using a pretrained language codebook (e.g., RoBERTa). Here are the key steps:

Image Encoding: Images are encoded into sequences of text tokens using a pretrained LLM's codebook.
Random Masking: The sequences undergo random masking, similar to the token masking in BERT.
Prediction and Reconstruction: A BERT model predicts the masked text tokens, and a decoder reconstructs the original image from these predicted text tokens.

This approach ensures that similar images produce similar clusters of text tokens, facilitating alignment between the visual and textual modalities without requiring paired text-image datasets.

Experimental Evaluation

The experimental setup demonstrates that LQAE can be employed for few-shot image classification using LLMs and linear classification of images based on BERT text features. Several notable experiments and results are discussed:

Linear Classification: Using intermediate representations of RoBERTa, the LQAE's performance in linear classification on ImageNet dataset significantly outperformed traditional VQ-VAE encoder representations.
Few-Shot Classification: LQAE was evaluated on Mini-ImageNet 2-way and 5-way few-shot tasks, showing competitive or superior performance to baselines that had access to large text-image-aligned datasets. Notably, LQAE did not use any text-image pairs during training, making the results more striking.

Results and Findings

The results showcase several impactful findings:

Unsupervised Text-Image Alignment: Achieving high accuracy in few-shot learning without aligned data underscores the potential of unsupervised learning approaches.
Human Readability Not Essential: The generated text tokens need not be human-readable, yet they retain sufficient internal structure for LLMs to leverage effectively in few-shot learning.
Masking Ratio: Experiments revealed that a higher masking ratio (around 50%) was necessary for optimal performance, deviating from the conventional 15% used for language denoisers like BERT.

Critical Analysis and Future Implications

While LQAE brings significant advancements, several points merit further exploration:

Interpretability: The text representations, though effective, are not human-readable. Future research might focus on making these representations human-interpretable without sacrificing alignment quality.
Scalability: Scaling the model with larger image encoders and BERT models could yield even better results. However, this requires substantial computational resources.
Generalization: Extending this technique to other modalities (e.g., audio, video) and exploring its generalizability could open up new applications in multimodal learning.

Conclusion

The Language Quantized AutoEncoder (LQAE) provides a robust framework for aligning text and image modalities without the need for curated datasets. Its ability to facilitate few-shot learning using LLMs marks a significant step forward in unsupervised multimodal learning. This method not only improves the practical utility of LLMs in visual tasks but also paves the way for future research into more sophisticated and scalable unsupervised learning techniques. The implications for AI are broad, promising advancements in fields that demand integration of diverse data modalities.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Hao Liu (497 papers)
Wilson Yan (12 papers)
Pieter Abbeel (372 papers)

Citations (20)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/SeanPedersen96/status/1938406580845748423