Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding (2210.03347v2)

Published 7 Oct 2022 in cs.CL and cs.CV
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Abstract: Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, LLMing, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images.

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

The paper "Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding" introduces a novel approach to addressing the challenges posed by visually-situated language. Unlike domain-specific methods, this approach leverages the versatility of pretraining on web page screenshots to develop a generalized model capable of understanding a wide range of visual language contexts. The model, termed Pix2Struct, is pretrained to convert masked screenshots into simplified HTML, which is detailed in subsequent sections of the paper.

Key Contributions and Methodology

Pix2Struct addresses the fragmented nature of previous work in visually-situated language understanding by proposing a unified pretraining strategy. The model processes screenshots to predict HTML parses, effectively integrating signals akin to OCR, LLMing, and image captioning. This represents an alignment with common pretraining signals while maintaining the advantages of a unified architecture.

Pretraining Strategy: The model uses an innovative screenshot parsing task that requires converting visual inputs to text by predicting the HTML structure of a webpage. This task is augmented by masked inputs that encourage the model to infer missing information in a manner similar to masked LLMing.

Variable-Resolution Input Representation: Pix2Struct innovates by introducing a variable-resolution input representation tailored for Vision Transformers. This ensures the preservation of the original aspect ratio of images, thereby enhancing robustness across various document and interface types.

The paper delineates the effectiveness of integrating language and vision inputs by directly rendering language prompts on images. This pretraining was shown to equip Pix2Struct with rich representations transferable to diverse downstream tasks in visual language understanding.

Evaluation and Performance

Pix2Struct was evaluated across a spectrum of tasks classified into four main domains: documents, illustrations, user interfaces, and natural images. Remarkably, the model achieved state-of-the-art results in six out of nine tasks, underscoring its efficacy as a versatile visual language understanding framework.

The model's strength lies in low-resource domains, such as illustrations and UIs, showcasing significant improvements over existing methodologies. However, in high-resource domains, while Pix2Struct did not surpass models with domain-specific pipelines, it demonstrated competitive performance, indicating the potential for further enhancement through scaling.

Theoretical and Practical Implications

Pix2Struct’s approach presents several implications for both theoretical research and practical application. Theoretically, it suggests a shift in pretraining paradigms for visual LLMs, highlighting the utility of large-scale, web-derived visual data. Practically, the model’s ability to reason over diverse visual contexts without the need for external OCR inputs reduces computational costs and engineering complexity.

This model opens avenues for future research to explore more sophisticated interaction models between visual elements and textual descriptions, potentially leading to more capable AI systems in tasks involving complex multimodal data.

Future Developments

The research underscores the potential gains from pretraining on expansive and rich visual datasets like the web. Future work could benefit from the progression in the efficiency and scalability of large-vision transformers, as well as a more curated approach to web data leveraging the dynamic and interactive content of modern web pages.

Ultimately, Pix2Struct serves as a promising step toward achieving more generalized and adaptable AI systems capable of understanding visually-situated language across an ever-expanding variety of contexts and applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022a.
  2. HTLM: hyper-text pre-training and prompting of language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022b. URL https://openreview.net/forum?id=P-pPW1nxf1r.
  3. DocFormer: End-to-end Transformer for document understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  993–1003, October 2021.
  4. Uibert: Learning generic multimodal representations for ui understanding. In Zhou, Z.-H. (ed.), Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pp. 1705–1712. International Joint Conferences on Artificial Intelligence Organization, 8 2021. doi: 10.24963/ijcai.2021/235. URL https://doi.org/10.24963/ijcai.2021/235. Main Track.
  5. Latr: Layout-aware transformer for scene-text vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16548–16558, 2022.
  6. Due: End-to-end document understanding benchmark. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  7. Grounding answers for visual questions asked by visually impaired people. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19098–19107, 2022a.
  8. Unblind your apps: Predicting natural-language labels for mobile gui components by deep learning. 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), pp.  322–334, 2020.
  9. Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021.
  10. A unified sequence interface for vision tasks. Advances in Neural Information Processing Systems, 35:31333–31346, 2022b.
  11. Pali: A jointly-scaled multilingual language-image model, 2022c. URL https://arxiv.org/abs/2209.06794.
  12. End-to-end document recognition and understanding with Dessurt. In Text in everything ECCV workshop, 2022. URL https://arxiv.org/abs/2203.16618.
  13. Image-to-markup generation with coarse-to-fine attention. In International Conference on Machine Learning, pp. 980–989. PMLR, 2017.
  14. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  16. Understanding tables with intermediate pre-training. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  281–296, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.27. URL https://aclanthology.org/2020.findings-emnlp.27.
  17. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6904–6913, 2017.
  18. Actionbert: Leveraging user actions for semantic understanding of user interfaces. In 35th AAAI Conference on Artificial Intelligence, AAAI 2021, 35th AAAI Conference on Artificial Intelligence, AAAI 2021, pp.  5931–5938. Association for the Advancement of Artificial Intelligence, 2021. Publisher Copyright: Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 35th AAAI Conference on Artificial Intelligence, AAAI 2021 ; Conference date: 02-02-2021 Through 09-02-2021.
  19. LayoutLMv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, 2022.
  20. A diagram is worth a dozen images. In European conference on computer vision, pp.  235–251. Springer, 2016.
  21. Donut: Document understanding transformer without OCR. In ECCV, 2022. URL https://arxiv.org/abs/2111.15664.
  22. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7871–7880, July 2020. doi: 10.18653/v1/2020.acl-main.703. URL https://aclanthology.org/2020.acl-main.703.
  23. StructuralLM: Structural pre-training for form understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  6309–6318, Online, August 2021a. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.493. URL https://aclanthology.org/2021.acl-long.493.
  24. Spotlight: Mobile UI understanding using vision-language models with a focus. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=9yE2xEj0BH7.
  25. MarkupLM: Pre-training of text and markup language for visually rich document understanding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  6078–6087, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.420. URL https://aclanthology.org/2022.acl-long.420.
  26. Mapping natural language instructions to mobile UI action sequences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  8198–8210, Online, July 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.729. URL https://www.aclweb.org/anthology/2020.acl-main.729.
  27. Widget captioning: Generating natural language description for mobile user interface elements. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  5495–5510, Online, November 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.443. URL https://aclanthology.org/2020.emnlp-main.443.
  28. Vut: Versatile ui transformer for multi-modal multi-task user interface modeling. arXiv preprint arXiv:2112.05692, 2021b.
  29. Learning design semantics for mobile apps. Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, 2018.
  30. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  31. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pp.  2263–2279, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.177. URL https://aclanthology.org/2022.findings-acl.177.
  32. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.  2200–2209, 2021.
  33. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  1697–1706, 2022.
  34. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pp.  947–952. IEEE, 2019.
  35. Going full-tilt boogie on document understanding with text-image-layout transformer. In International Conference on Document Analysis and Recognition, pp.  732–747. Springer, 2021.
  36. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0.
  37. Improving language understanding by generative pre-training. OpenAI, 2018.
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  39. Language modelling with pixels. arXiv preprint arXiv:2207.06991, 2022.
  40. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2556–2565, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1238. URL https://aclanthology.org/P18-1238.
  41. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp. 4596–4604. PMLR, 2018.
  42. Textcaps: a dataset for image captioningwith reading comprehension. In European Conference on Computer Vision, 2020.
  43. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8317–8326, 2019.
  44. Unifying vision, text, and layout for universal document processing. arXiv preprint arXiv:2212.02623, 2022.
  45. Fixing the train-test resolution discrepancy. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/d03a857a23b5285736c4d55e0bb067c8-Paper.pdf.
  46. Screen2words: Automatic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology, UIST ’21, pp.  498–510, New York, NY, USA, 2021a. Association for Computing Machinery. ISBN 9781450386357. doi: 10.1145/3472749.3474765. URL https://doi.org/10.1145/3472749.3474765.
  47. Multimodal attention with image text spatial relationship for ocr-based image captioning. In Proceedings of the 28th ACM International Conference on Multimedia, pp.  4337–4345, 2020.
  48. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a.
  49. Webformer: The web-page transformer for structure information extraction. In Proceedings of the ACM Web Conference 2022, pp. 3124–3133, 2022b.
  50. Simvlm: Simple visual language model pretraining with weak supervision. CoRR, abs/2108.10904, 2021b. URL https://arxiv.org/abs/2108.10904.
  51. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. URL https://openreview.net/forum?id=yzkSU5zdwD. Survey Certification.
  52. Screen parsing: Towards reverse engineering of UI models from screenshots. In The 34th Annual ACM Symposium on User Interface Software and Technology, pp.  470–483, 2021.
  53. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL) 2021, 2021.
  54. Tap: Text-aware pre-training for text-vqa and text-caption. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8751–8761, 2021.
  55. Screen recognition: Creating accessibility metadata for mobile applications from pixels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp.  1–15, 2021.
  56. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV), pp.  19–27, 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Kenton Lee (40 papers)
  2. Mandar Joshi (24 papers)
  3. Iulia Turc (6 papers)
  4. Hexiang Hu (48 papers)
  5. Fangyu Liu (59 papers)
  6. Julian Eisenschlos (4 papers)
  7. Urvashi Khandelwal (12 papers)
  8. Peter Shaw (23 papers)
  9. Ming-Wei Chang (44 papers)
  10. Kristina Toutanova (31 papers)
Citations (217)
Github Logo Streamline Icon: https://streamlinehq.com