Autoregressive Pre-Training on Pixels and Texts (2404.10710v3)

Published 16 Apr 2024 in cs.CL and cs.CV

Abstract: The integration of visual and textual information represents a promising direction in the advancement of LLMs. In this paper, we explore the dual modality of language--both visual and textual--within an autoregressive framework, pre-trained on both document images and texts. Our method employs a multimodal training strategy, utilizing visual data through next patch prediction with a regression head and/or textual data through next token prediction with a classification head. We focus on understanding the interaction between these two modalities and their combined impact on model performance. Our extensive evaluation across a wide range of benchmarks shows that incorporating both visual and textual data significantly improves the performance of pixel-based LLMs. Remarkably, we find that a unidirectional pixel-based model trained solely on visual data can achieve comparable results to state-of-the-art bidirectional models on several language understanding tasks. This work uncovers the untapped potential of integrating visual and textual modalities for more effective LLMing. We release our code, data, and model checkpoints at \url{https://github.com/ernie-research/pixelgpt}.

References (37)

Authors (6)

Yekun Chai (18 papers)
Qingyi Liu (3 papers)
Jingwu Xiao (1 paper)
Shuohuan Wang (30 papers)
Yu Sun (226 papers)
Hua Wu (191 papers)

Citations (1)

View on Semantic Scholar

Summary

Comprehensive Study on Autoregressive Pixel-Based LLMs with Visual Text Pre-Training

Introduction

The research presented introduces an innovative framework for pixel-based autoregressive LLMs that utilize a corpus of over 400 million documents rendered as RGB images. This approach employs a dual-modality training regimen that integrates both visual and textual data. By training on visual data using a next patch prediction and textual data via next token prediction, this paper investigates the synergy between visual and textual modalities in LLMs.

Pre-training Methodology

The methodology section details the processes involved in rendering textual data into RGB images and the subsequent pre-training objectives.

Rendering Text as Images

Text is transformed into RGB images with each image representing a sequence of text in a grid of patches, which are then used for model training. The paper specifies a resolution for the rendered images and describes the use of visual cues within these images to manage sequence ends and lengths. Such visual representation allows the model to interact with text in its visual form, bypassing constraints found in traditional text tokenization.

Dual-Modality Training

The model architecture uses separate prediction heads for visual and text data inputs, employing different loss functions appropriate to each modality. The training leverages both pixel-only and text-only data streams, as well as paired image-text data in various configurations to investigate the impact of diverse training setups on model performance.

Experimental Setup and Results

The model is evaluated across several language understanding benchmarks including GLUE and XNLI, providing a comparison with other prominent models like BERT and PIXEL.

Performance Evaluation

The pixel-based model, especially in configurations where dual-modality data was used, demonstrated competitive or superior performance compared to traditional text-based and other pixel-based models. For instance, in the GLUE benchmark, the proposed model outperformed several baselines and showed significant improvements in tasks requiring deeper understanding of context and nuances in language.

Cross-Lingual Evaluation

In the XNLI dataset, which assesses cross-lingual understanding, the model achieved robust performance across multiple languages. This illustrates the model’s capability to generalize across different linguistic frameworks without the need for language-specific tokenization, showcasing its potential in multilingual settings.

Analysis

The training size and batch configurations were also analyzed, revealing a preference for large batch sizes in fine-tuning the models, which helps stabilize the learning process and improve performance. Moreover, the effectiveness of model training with RGB data as opposed to grayscale indicates the importance of color information in processing visual texts.

Conclusion and Future Directions

The paper confirms the viability of using RGB images of textual content for training LLMs and highlights the benefits of integrating visual and textual data. The findings open avenues for future work to expand the model scales and explore more extensive pre-training regimes. The potential for improving multimodal language processing by scaling up the model and training on even larger datasets is promising.

This research represents a significant step toward enhancing the capabilities of LLMs by leveraging the rich information content offered by visual representations of text. Further exploration in this direction can lead to more sophisticated and contextually aware language processing tools.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/nickgweezy/status/1784932221926715397