Language Modelling with Pixels (2207.06991v2)

Published 14 Jul 2022 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: LLMs are defined over a finite set of inputs, which creates a vocabulary bottleneck when we attempt to scale the number of supported languages. Tackling this bottleneck results in a trade-off between what can be represented in the embedding matrix and computational issues in the output layer. This paper introduces PIXEL, the Pixel-based Encoder of Language, which suffers from neither of these issues. PIXEL is a pretrained LLM that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels. PIXEL is trained to reconstruct the pixels of masked patches instead of predicting a distribution over tokens. We pretrain the 86M parameter PIXEL model on the same English data as BERT and evaluate on syntactic and semantic tasks in typologically diverse languages, including various non-Latin scripts. We find that PIXEL substantially outperforms BERT on syntactic and semantic processing tasks on scripts that are not found in the pretraining data, but PIXEL is slightly weaker than BERT when working with Latin scripts. Furthermore, we find that PIXEL is more robust than BERT to orthographic attacks and linguistic code-switching, further confirming the benefits of modelling language with pixels.

Authors (6)

Phillip Rust (12 papers)
Jonas F. Lotz (6 papers)
Emanuele Bugliarello (27 papers)
Elizabeth Salesky (27 papers)
Miryam de Lhoneux (29 papers)
Desmond Elliott (53 papers)

Citations (37)

View on Semantic Scholar

Summary

An Analysis of LLMling with Pixels

The paper "LLMling with Pixels" introduces the Pixel-based Encoder of Language (Pixel), an innovative approach to NLP that presents text as pixelated images rather than relying on traditional token-based methods. This method addresses the prevalent issue of a vocabulary bottleneck in pretrained LLMs (PLMs), offering a potential solution for incorporating a vast array of languages without the constraints of finite vocabularies associated with subword tokens.

Model Architecture and Training

Pixel employs a three-part architecture: a text renderer, an encoder, and a decoder. The renderer converts text into images using a square resolution, which allows the model to process a diverse range of scripts and languages, including those with complex layouts. The Vision Transformer architecture (ViT-MAE) is utilized, employing techniques like span masking for efficient representation learning.

Pixel's model consists of 86M parameters, trained on English data identical to that used for BERT models. The model's outputs are restructured using a loss function that reconstructs masked image patches. Unlike traditional models, Pixel bypasses vocabulary constraints by focusing on the orthographic encoding of language.

Empirical Evaluation

Pixel's effectiveness was tested on syntactic and semantic tasks across 32 languages with 14 different scripts, including scripts absent from its pretraining corpus. The evaluation revealed that Pixel significantly outperform BERT on tasks in non-Latin scripts, demonstrating marked improvements under conditions where traditional token-based models often falter. However, Pixel was slightly less effective than BERT on tasks employing Latin scripts, indicating potential areas for further refinement.

Robustness and Adaptability

One notable feature of Pixel is its robustness against orthographic attacks and its handling of linguistic code-switching, maintaining performance where token-based models like BERT show vulnerabilities. The lack of dependency on predefined vocabulary provides Pixel with a mechanism to adapt swiftly to text variations and character substitutions, highlighting the flexibility of the pixel-based approach.

Implications and Future Directions

The research addresses the limitations of token-based NLP models in handling diverse and complex scripts, offering a pathway towards more inclusive and extensive language representation capabilities. By treating text as an image, Pixel sidesteps traditional tokenization, which poses significant challenges in multilingual contexts.

Looking forward, there are numerous avenues for improving Pixel's performance. A primary area of exploration is the enhancement of semantic task performance, potentially through extended pretraining or the incorporation of alternative objectives to better capture long-range dependencies. Additionally, training on a more typologically diverse corpus could further enhance Pixel's adaptability across languages.

The introduction of Pixel sets the stage for further exploring the intersection of computer vision and NLP, suggesting possibilities for enhancing multilingual models without the constraints of existing tokenization schemes. By refining this approach, there is potential to significantly improve the capabilities of AI in processing global languages.

In conclusion, the pixel-based LLMling approach presents a compelling and scalable alternative in NLP, offering exciting possibilities for expanding LLM horizons. While challenges remain, the foundational work in this paper provides crucial insights and opens new avenues for inclusive LLM development.

PDF Markdown

Related Papers

GitHub

GitHub - xplip/pixel: Research code for pixel-based encoders of language (PIXEL) (336 stars)

Tweets

https://twitter.com/yaroslavvb/status/1759319849812250714

https://twitter.com/andregraubner/status/1826326685740138791

https://twitter.com/thelokasiffers/status/1816877183912051099