Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding (2210.03347v2)

Published 7 Oct 2022 in cs.CL and cs.CV

Abstract: Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, LLMing, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images.

PDF HTML Abstract

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

The paper "Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding" introduces a novel approach to addressing the challenges posed by visually-situated language. Unlike domain-specific methods, this approach leverages the versatility of pretraining on web page screenshots to develop a generalized model capable of understanding a wide range of visual language contexts. The model, termed Pix2Struct, is pretrained to convert masked screenshots into simplified HTML, which is detailed in subsequent sections of the paper.

Key Contributions and Methodology

Pix2Struct addresses the fragmented nature of previous work in visually-situated language understanding by proposing a unified pretraining strategy. The model processes screenshots to predict HTML parses, effectively integrating signals akin to OCR, LLMing, and image captioning. This represents an alignment with common pretraining signals while maintaining the advantages of a unified architecture.

Pretraining Strategy: The model uses an innovative screenshot parsing task that requires converting visual inputs to text by predicting the HTML structure of a webpage. This task is augmented by masked inputs that encourage the model to infer missing information in a manner similar to masked LLMing.

Variable-Resolution Input Representation: Pix2Struct innovates by introducing a variable-resolution input representation tailored for Vision Transformers. This ensures the preservation of the original aspect ratio of images, thereby enhancing robustness across various document and interface types.

The paper delineates the effectiveness of integrating language and vision inputs by directly rendering language prompts on images. This pretraining was shown to equip Pix2Struct with rich representations transferable to diverse downstream tasks in visual language understanding.

Evaluation and Performance

Pix2Struct was evaluated across a spectrum of tasks classified into four main domains: documents, illustrations, user interfaces, and natural images. Remarkably, the model achieved state-of-the-art results in six out of nine tasks, underscoring its efficacy as a versatile visual language understanding framework.

The model's strength lies in low-resource domains, such as illustrations and UIs, showcasing significant improvements over existing methodologies. However, in high-resource domains, while Pix2Struct did not surpass models with domain-specific pipelines, it demonstrated competitive performance, indicating the potential for further enhancement through scaling.

Theoretical and Practical Implications

Pix2Struct’s approach presents several implications for both theoretical research and practical application. Theoretically, it suggests a shift in pretraining paradigms for visual LLMs, highlighting the utility of large-scale, web-derived visual data. Practically, the model’s ability to reason over diverse visual contexts without the need for external OCR inputs reduces computational costs and engineering complexity.

This model opens avenues for future research to explore more sophisticated interaction models between visual elements and textual descriptions, potentially leading to more capable AI systems in tasks involving complex multimodal data.

Future Developments

The research underscores the potential gains from pretraining on expansive and rich visual datasets like the web. Future work could benefit from the progression in the efficiency and scalability of large-vision transformers, as well as a more curated approach to web data leveraging the dynamic and interactive content of modern web pages.

Ultimately, Pix2Struct serves as a promising step toward achieving more generalized and adaptable AI systems capable of understanding visually-situated language across an ever-expanding variety of contexts and applications.

PDF Markdown Bookmark Chat (Pro)

References (56)

Authors (10)

Kenton Lee (40 papers)
Mandar Joshi (24 papers)
Iulia Turc (6 papers)
Hexiang Hu (48 papers)
Fangyu Liu (59 papers)
Julian Eisenschlos (4 papers)
Urvashi Khandelwal (12 papers)
Peter Shaw (23 papers)
Ming-Wei Chang (44 papers)
Kristina Toutanova (31 papers)

Citations (217)

View on Semantic Scholar

GitHub

GitHub - google-research/pix2struct (563 stars)