Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

PIXAR: Auto-Regressive Language Modeling in Pixel Space (2401.03321v2)

Published 6 Jan 2024 in cs.CL

Abstract: Recent work showed the possibility of building open-vocabulary LLMs that directly operate on pixel representations. These models are implemented as autoencoders that reconstruct masked patches of rendered text. However, these pixel-based LLMs are limited to discriminative tasks (e.g., classification) and, similar to BERT, cannot be used to generate text. Therefore, they cannot be used for generative tasks such as free-form question answering. In this work, we introduce PIXAR, the first pixel-based autoregressive LLM that performs text generation. Consisting of only a decoder, PIXAR can perform free-form generative tasks while keeping the number of parameters on par with previous encoder-decoder models. Furthermore, we highlight the challenges of generating text as non-noisy images and show this is due to using a maximum likelihood objective. To overcome this problem, we propose an adversarial pretraining stage that improves the readability and accuracy of PIXAR by 8.1 on LAMBADA and 8.5 on bAbI -- making it comparable to GPT-2 on text generation tasks. This paves the way to build open-vocabulary LLMs that operate on perceptual input only and calls into question the necessity of the usual symbolic input representation, i.e., text as (sub)tokens.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Cm3: A causal masked multimodal model of the internet, 2022.
  2. Scaling laws for generative mixed-modal language models, 2023.
  3. Beit: Bert pre-training of image transformers, 2022.
  4. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
  5. Language models are few-shot learners, 2020.
  6. Maskgit: Masked generative image transformer, 2022.
  7. Muse: Text-to-image generation via masked generative transformers, 2023.
  8. Generative pretraining from pixels. In International conference on machine learning, pp.  1691–1703. PMLR, 2020a.
  9. Generative pretraining from pixels. In International conference on machine learning, pp.  1691–1703. PMLR, 2020b.
  10. Revisiting pre-trained models for chinese natural language processing. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.findings-emnlp.58. URL http://dx.doi.org/10.18653/v1/2020.findings-emnlp.58.
  11. Glyph-aware embedding of Chinese characters. In Faruqui, M., Schuetze, H., Trancoso, I., and Yaghoobzadeh, Y. (eds.), Proceedings of the First Workshop on Subword and Character Level Models in NLP, pp.  64–69, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4109. URL https://aclanthology.org/W17-4109.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  13. Text processing like humans do: Visually attacking and shielding nlp systems, 2020.
  14. Taming transformers for high-resolution image synthesis, 2021.
  15. Generative adversarial networks, 2014.
  16. Masked autoencoders are scalable vision learners, 2021.
  17. Perceiver: General perception with iterative attention, 2021.
  18. Exploring the limits of language modeling, 2016.
  19. Challenges and applications of large language models, 2023.
  20. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, 2018.
  21. Glyphdiffusion: Text generation as image generation, 2023a.
  22. Mage: Masked generative encoder to unify representation learning and image synthesis, 2023b.
  23. Learning character-level compositionality with visual features. In Barzilay, R. and Kan, M.-Y. (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2059–2068, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1188. URL https://aclanthology.org/P17-1188.
  24. Visually grounded reasoning across languages and cultures. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  10467–10485, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.emnlp-main.818.
  25. Sgdr: Stochastic gradient descent with warm restarts, 2017.
  26. Decoupled weight decay regularization, 2019.
  27. Glyce: Glyph-vectors for chinese character representations, 2020.
  28. Generating high fidelity images with subscale pixel networks and multidimensional upscaling, 2018.
  29. A Course in Game Theory. The MIT Press, 1994. ISBN 0262150417.
  30. The lambada dataset: Word prediction requiring a broad discourse context, 2016.
  31. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023.
  32. Language models are unsupervised multitask learners. 2019.
  33. Hierarchical text-conditional image generation with clip latents, 2022.
  34. High-resolution image synthesis with latent diffusion models, 2022.
  35. Language modelling with pixels, 2023.
  36. Robust open-vocabulary translation from visual text representations, 2021.
  37. Multilingual pixel representations for translation and effective cross-lingual transfer, 2023.
  38. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications, 2017.
  39. Neural machine translation of rare words with subword units, 2016.
  40. Shannon, C. E. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948.
  41. Shazeer, N. Glu variants improve transformer, 2020.
  42. Roformer: Enhanced transformer with rotary position embedding, 2023.
  43. Super characters: A conversion from sentiment classification to image classification. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp.  309–315, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-6245. URL https://aclanthology.org/W18-6245.
  44. ChineseBERT: Chinese pretraining enhanced by glyph and Pinyin information. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  2065–2075, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.161. URL https://aclanthology.org/2021.acl-long.161.
  45. Llama: Open and efficient foundation language models, 2023a.
  46. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  47. Givt: Generative infinite-vocabulary transformers, 2023.
  48. Pixel recurrent neural networks, 2016.
  49. Neural discrete representation learning, 2018.
  50. Attention is all you need, 2023.
  51. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Linzen, T., Chrupała, G., and Alishahi, A. (eds.), Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://aclanthology.org/W18-5446.
  52. Towards ai-complete question answering: A set of prerequisite toy tasks, 2015.
  53. Bloom: A 176b-parameter open-access multilingual language model, 2023.
  54. Google’s neural machine translation system: Bridging the gap between human and machine translation, 2016.
  55. mt5: A massively multilingual pre-trained text-to-text transformer, 2021.
Citations (4)

Summary

  • The paper presents Pixar as the first auto-regressive language model capable of generating text from pixel representations.
  • It employs a GPT-like decoder-only Transformer framework enhanced with RMSNorm, SwiGLU, and rotary embeddings for effective pixel processing.
  • A two-stage training, including maximum likelihood and adversarial pretraining, improves readability and achieves performance comparable to GPT-2.

Overview of "Pixar: Auto-Regressive LLMing in Pixel Space"

The paper "Pixar: Auto-Regressive LLMing in Pixel Space" introduces a novel direction in the field of NLP by presenting Pixar, the first autoregressive LLM that operates exclusively on pixel representations of text. This approach represents a significant departure from traditional token-based LLMs and opens new avenues for research and application in both symbolic and perceptual domains.

Key Contributions

  1. Pixel-Based Text Generation: Pixar is notable for being the first model capable of generating textual content directly from pixel representations. While previous models, such as Pixel, were limited to discriminative tasks, Pixar advances the field by enabling generative tasks, thus filling a gap in the capabilities of pixel-based models.
  2. Model Architecture: The architecture of Pixar is rooted in a decoder-only framework akin to GPT-like models. It leverages a stack of Transformer layers enhanced with state-of-the-art components like RMSNorm, SwiGLU, and rotary positional embeddings to process sequences of image patches representing text.
  3. Innovative Training Strategy: The authors introduce a two-stage pretraining process. Initially, Pixar is trained using maximum likelihood estimation (MLE) to predict the next sequence of pixel patches. To address the inherent noisiness in pixel-based text generation, a second adversarial pretraining stage is utilized, improving readability and accuracy significantly.
  4. Comparative Performance: The paper demonstrates that Pixar, with its purely pixel-based approach, achieves performance on generative tasks comparable to the established GPT-2 model, while also maintaining efficiency in parameter usage. Furthermore, its robustness to orthographic attacks underscores a unique advantage over token-based models.
  5. Attention and Symbolic Learning: The analysis of attention patterns within Pixar provides insights into how the model interprets perceptual input to make predictions. This attention analysis suggests that Pixar can implicitly learn symbolic information from purely visual cues, a remarkable capability that challenges traditional token-based understanding.

Implications and Future Directions

The research implications of Pixar are profound, suggesting a paradigm shift in how text can be processed and generated in machine learning models. By operating purely in pixel space, Pixar questions the conventional necessity of symbolic tokenization, a foundational step in NLP pipelines. This pixel-based approach can lead to more inclusive and robust LLMs capable of handling diverse input forms beyond standard textual data.

Practically, Pixar paves the way for cross-modal applications where both text and other visual data types can be unified under a single modeling framework. Such models hold promise for enhanced multimodal AI systems capable of more human-like processing of diverse inputs.

Looking ahead, the authors acknowledge certain limitations in their current approach, such as scalability and performance bottlenecks. Future work may focus on expanding Pixar's capabilities to other language systems, incorporating larger datasets to see if the pixel-based paradigm scales with data size effectively, and exploring hybrid models that blend pixel and token-based approaches for optimized performance across tasks.

In conclusion, Pixar represents a compelling step forward in the exploration of pixel-based representations in NLP, providing a fresh perspective on the long-standing challenge of LLMing and introducing new challenges and opportunities for the advancement of AI.