PIXAR: Auto-Regressive Language Modeling in Pixel Space (2401.03321v2)
Abstract: Recent work showed the possibility of building open-vocabulary LLMs that directly operate on pixel representations. These models are implemented as autoencoders that reconstruct masked patches of rendered text. However, these pixel-based LLMs are limited to discriminative tasks (e.g., classification) and, similar to BERT, cannot be used to generate text. Therefore, they cannot be used for generative tasks such as free-form question answering. In this work, we introduce PIXAR, the first pixel-based autoregressive LLM that performs text generation. Consisting of only a decoder, PIXAR can perform free-form generative tasks while keeping the number of parameters on par with previous encoder-decoder models. Furthermore, we highlight the challenges of generating text as non-noisy images and show this is due to using a maximum likelihood objective. To overcome this problem, we propose an adversarial pretraining stage that improves the readability and accuracy of PIXAR by 8.1 on LAMBADA and 8.5 on bAbI -- making it comparable to GPT-2 on text generation tasks. This paves the way to build open-vocabulary LLMs that operate on perceptual input only and calls into question the necessity of the usual symbolic input representation, i.e., text as (sub)tokens.
- Cm3: A causal masked multimodal model of the internet, 2022.
- Scaling laws for generative mixed-modal language models, 2023.
- Beit: Bert pre-training of image transformers, 2022.
- Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
- Language models are few-shot learners, 2020.
- Maskgit: Masked generative image transformer, 2022.
- Muse: Text-to-image generation via masked generative transformers, 2023.
- Generative pretraining from pixels. In International conference on machine learning, pp. 1691–1703. PMLR, 2020a.
- Generative pretraining from pixels. In International conference on machine learning, pp. 1691–1703. PMLR, 2020b.
- Revisiting pre-trained models for chinese natural language processing. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.findings-emnlp.58. URL http://dx.doi.org/10.18653/v1/2020.findings-emnlp.58.
- Glyph-aware embedding of Chinese characters. In Faruqui, M., Schuetze, H., Trancoso, I., and Yaghoobzadeh, Y. (eds.), Proceedings of the First Workshop on Subword and Character Level Models in NLP, pp. 64–69, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4109. URL https://aclanthology.org/W17-4109.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- Text processing like humans do: Visually attacking and shielding nlp systems, 2020.
- Taming transformers for high-resolution image synthesis, 2021.
- Generative adversarial networks, 2014.
- Masked autoencoders are scalable vision learners, 2021.
- Perceiver: General perception with iterative attention, 2021.
- Exploring the limits of language modeling, 2016.
- Challenges and applications of large language models, 2023.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, 2018.
- Glyphdiffusion: Text generation as image generation, 2023a.
- Mage: Masked generative encoder to unify representation learning and image synthesis, 2023b.
- Learning character-level compositionality with visual features. In Barzilay, R. and Kan, M.-Y. (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2059–2068, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1188. URL https://aclanthology.org/P17-1188.
- Visually grounded reasoning across languages and cultures. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 10467–10485, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.emnlp-main.818.
- Sgdr: Stochastic gradient descent with warm restarts, 2017.
- Decoupled weight decay regularization, 2019.
- Glyce: Glyph-vectors for chinese character representations, 2020.
- Generating high fidelity images with subscale pixel networks and multidimensional upscaling, 2018.
- A Course in Game Theory. The MIT Press, 1994. ISBN 0262150417.
- The lambada dataset: Word prediction requiring a broad discourse context, 2016.
- The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023.
- Language models are unsupervised multitask learners. 2019.
- Hierarchical text-conditional image generation with clip latents, 2022.
- High-resolution image synthesis with latent diffusion models, 2022.
- Language modelling with pixels, 2023.
- Robust open-vocabulary translation from visual text representations, 2021.
- Multilingual pixel representations for translation and effective cross-lingual transfer, 2023.
- Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications, 2017.
- Neural machine translation of rare words with subword units, 2016.
- Shannon, C. E. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948.
- Shazeer, N. Glu variants improve transformer, 2020.
- Roformer: Enhanced transformer with rotary position embedding, 2023.
- Super characters: A conversion from sentiment classification to image classification. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 309–315, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-6245. URL https://aclanthology.org/W18-6245.
- ChineseBERT: Chinese pretraining enhanced by glyph and Pinyin information. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2065–2075, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.161. URL https://aclanthology.org/2021.acl-long.161.
- Llama: Open and efficient foundation language models, 2023a.
- Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Givt: Generative infinite-vocabulary transformers, 2023.
- Pixel recurrent neural networks, 2016.
- Neural discrete representation learning, 2018.
- Attention is all you need, 2023.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Linzen, T., Chrupała, G., and Alishahi, A. (eds.), Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://aclanthology.org/W18-5446.
- Towards ai-complete question answering: A set of prerequisite toy tasks, 2015.
- Bloom: A 176b-parameter open-access multilingual language model, 2023.
- Google’s neural machine translation system: Bridging the gap between human and machine translation, 2016.
- mt5: A massively multilingual pre-trained text-to-text transformer, 2021.