Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Autoregressive Pre-Training on Pixels and Texts (2404.10710v3)

Published 16 Apr 2024 in cs.CL and cs.CV

Abstract: The integration of visual and textual information represents a promising direction in the advancement of LLMs. In this paper, we explore the dual modality of language--both visual and textual--within an autoregressive framework, pre-trained on both document images and texts. Our method employs a multimodal training strategy, utilizing visual data through next patch prediction with a regression head and/or textual data through next token prediction with a classification head. We focus on understanding the interaction between these two modalities and their combined impact on model performance. Our extensive evaluation across a wide range of benchmarks shows that incorporating both visual and textual data significantly improves the performance of pixel-based LLMs. Remarkably, we find that a unidirectional pixel-based model trained solely on visual data can achieve comparable results to state-of-the-art bidirectional models on several language understanding tasks. This work uncovers the untapped potential of integrating visual and textual modalities for more effective LLMing. We release our code, data, and model checkpoints at \url{https://github.com/ernie-research/pixelgpt}.

Comprehensive Study on Autoregressive Pixel-Based LLMs with Visual Text Pre-Training

Introduction

The research presented introduces an innovative framework for pixel-based autoregressive LLMs that utilize a corpus of over 400 million documents rendered as RGB images. This approach employs a dual-modality training regimen that integrates both visual and textual data. By training on visual data using a next patch prediction and textual data via next token prediction, this paper investigates the synergy between visual and textual modalities in LLMs.

Pre-training Methodology

The methodology section details the processes involved in rendering textual data into RGB images and the subsequent pre-training objectives.

Rendering Text as Images

Text is transformed into RGB images with each image representing a sequence of text in a grid of patches, which are then used for model training. The paper specifies a resolution for the rendered images and describes the use of visual cues within these images to manage sequence ends and lengths. Such visual representation allows the model to interact with text in its visual form, bypassing constraints found in traditional text tokenization.

Dual-Modality Training

The model architecture uses separate prediction heads for visual and text data inputs, employing different loss functions appropriate to each modality. The training leverages both pixel-only and text-only data streams, as well as paired image-text data in various configurations to investigate the impact of diverse training setups on model performance.

Experimental Setup and Results

The model is evaluated across several language understanding benchmarks including GLUE and XNLI, providing a comparison with other prominent models like BERT and PIXEL.

Performance Evaluation

The pixel-based model, especially in configurations where dual-modality data was used, demonstrated competitive or superior performance compared to traditional text-based and other pixel-based models. For instance, in the GLUE benchmark, the proposed model outperformed several baselines and showed significant improvements in tasks requiring deeper understanding of context and nuances in language.

Cross-Lingual Evaluation

In the XNLI dataset, which assesses cross-lingual understanding, the model achieved robust performance across multiple languages. This illustrates the model’s capability to generalize across different linguistic frameworks without the need for language-specific tokenization, showcasing its potential in multilingual settings.

Analysis

The training size and batch configurations were also analyzed, revealing a preference for large batch sizes in fine-tuning the models, which helps stabilize the learning process and improve performance. Moreover, the effectiveness of model training with RGB data as opposed to grayscale indicates the importance of color information in processing visual texts.

Conclusion and Future Directions

The paper confirms the viability of using RGB images of textual content for training LLMs and highlights the benefits of integrating visual and textual data. The findings open avenues for future work to expand the model scales and explore more extensive pre-training regimes. The potential for improving multimodal language processing by scaling up the model and training on even larger datasets is promising.

This research represents a significant step toward enhancing the capabilities of LLMs by leveraging the rich information content offered by visual representations of text. Further exploration in this direction can lead to more sophisticated and contextually aware language processing tools.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245.
  2. Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805.
  3. Highway transformer: Self-gating enhanced self-attentive networks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6887–6900, Online. Association for Computational Linguistics.
  4. ERNIE-code: Beyond English-centric cross-lingual pretraining for programming languages. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10628–10650, Toronto, Canada. Association for Computational Linguistics.
  5. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 1691–1703. PMLR.
  6. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8440–8451. Association for Computational Linguistics.
  7. XNLI: evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2475–2485. Association for Computational Linguistics.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  10. Scalable pre-training of large autoregressive image models. arXiv preprint arXiv:2401.08541.
  11. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883.
  12. Masked autoencoders are scalable vision learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 15979–15988. IEEE.
  13. Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer.
  14. Hatemoji: A test suite and adversarially-generated dataset for benchmarking and detecting emoji-based hate. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1352–1368, Seattle, United States. Association for Computational Linguistics.
  15. The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533.
  16. Text rendering strategies for pixel language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 10155–10172. Association for Computational Linguistics.
  17. Starcoder 2 and the stack v2: The next generation. CoRR, abs/2402.19173.
  18. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  19. Language models are unsupervised multitask learners.
  20. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67.
  21. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr.
  22. Language modelling with pixels. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  23. Multilingual pixel representations for translation and effective cross-lingual transfer. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 13845–13861. Association for Computational Linguistics.
  24. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
  25. Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202.
  26. Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint.
  27. Luca Soldaini and Kyle Lo. 2023. peS2o (Pretraining Efficiently on S2ORC) Dataset. Technical report, Allen Institute for AI. ODC-By, https://github.com/allenai/pes2o.
  28. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
  29. Pixar: Auto-regressive language modeling in pixel space. arXiv preprint arXiv:2401.03321.
  30. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  31. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  32. CLIPPO: image-and-language understanding from pixels only. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 11006–11017. IEEE.
  33. Neural discrete representation learning. Advances in neural information processing systems, 30.
  34. Attention is all you need. Advances in neural information processing systems, 30.
  35. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the Workshop: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@EMNLP 2018, Brussels, Belgium, November 1, 2018, pages 353–355. Association for Computational Linguistics.
  36. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics.
  37. Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yekun Chai (18 papers)
  2. Qingyi Liu (3 papers)
  3. Jingwu Xiao (1 paper)
  4. Shuohuan Wang (30 papers)
  5. Yu Sun (226 papers)
  6. Hua Wu (191 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com