Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Language Understanding from Screenshots (2402.14073v1)

Published 21 Feb 2024 in cs.CL, cs.LG, and cs.CV

Abstract: An emerging family of LLMs (LMs), capable of processing both text and images within a single visual view, has the promise to unlock complex tasks such as chart understanding and UI navigation. We refer to these models as screenshot LLMs. Despite their appeal, existing screenshot LMs substantially lag behind text-only models on language understanding tasks. To close this gap, we adopt a simplified setting where the model inputs are plain-text-rendered screenshots, and we focus on improving the text ability of screenshot LMs. We propose a novel Patch-and-Text Prediction (PTP) objective, which masks and recovers both image patches of screenshots and text within screenshots. We also conduct extensive ablation studies on masking rates and patch sizes, as well as designs for improving training stability. Our pre-trained model, while solely taking visual inputs, achieves comparable performance with BERT on 6 out of 8 GLUE tasks (within 2%) and improves up to 8% over prior work. Additionally, we extend PTP to train autoregressive screenshot LMs and demonstrate its effectiveness--our models can significantly reduce perplexity by utilizing the screenshot context. Together, we hope our findings can inspire future research on developing powerful screenshot LMs and extending their reach to broader applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. DUBLIN: Visual document understanding by language-image network. In Empirical Methods in Natural Language Processing (EMNLP), pages 693–706.
  2. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems.
  3. Pixt3: Pixel-based table to text generation. arXiv preprint arXiv:2311.09808.
  4. Docformer: End-to-end transformer for document understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 993–1003.
  5. Uibert: Learning generic multimodal representations for ui understanding. In International Joint Conference on Artificial Intelligence (IJCAI).
  6. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.
  7. The second PASCAL recognising textual entailment challenge.
  8. Fuyu-8b: A multimodal architecture for ai agents.
  9. The fifth PASCAL recognizing textual entailment challenge. In TAC.
  10. PHD: Pixel-based language modeling of historical documents. In Empirical Methods in Natural Language Processing (EMNLP), pages 87–107.
  11. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS).
  12. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In the 11th International Workshop on Semantic Evaluation (SemEval-2017).
  13. PaLI: A jointly-scaled multilingual language-image model. In International Conference on Learning Representations (ICLR).
  14. The PASCAL recognising textual entailment challenge. In the First International Conference on Machine Learning Challenges: Evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment.
  15. Instructblip: Towards general-purpose vision-language models with instruction tuning.
  16. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems (NeurIPS), 35:16344–16359.
  17. Language modeling with gated convolutional networks. In International Conference on Machine Learning (ICML), pages 933–941. PMLR.
  18. End-to-end document recognition and understanding with dessurt. In European Conference on Computer Vision, pages 280–296. Springer.
  19. BERT: Pre-training of deep bidirectional Transformers for language understanding. In North American Chapter of the Association for Computational Linguistics (NAACL).
  20. William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In the Third International Workshop on Paraphrasing (IWP2005).
  21. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR).
  22. Palm-e: an embodied multimodal language model. In International Conference on Machine Learning (ICML).
  23. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010.
  24. Gemini Team. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  25. The third PASCAL recognizing textual entailment challenge. In the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing.
  26. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009.
  27. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091.
  28. pybind11 — seamless operability between c++11 and python. Https://github.com/pybind/pybind11.
  29. Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer.
  30. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning (ICML), pages 18893–18912.
  31. Building a test collection for complex document information processing. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 665–666.
  32. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Association for Computational Linguistics (ACL), pages 7871–7880.
  33. Otterhd: A high-resolution multi-modality model.
  34. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
  35. Gang Li and Yang Li. 2023. Spotlight: Mobile UI understanding using vision-language models with a focus. In International Conference on Learning Representations (ICLR).
  36. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.
  37. Renderdiffusion: Text generation as image generation. arXiv preprint arXiv:2304.12519.
  38. Mapping natural language instructions to mobile UI action sequences. In Association for Computational Linguistics (ACL), pages 8198–8210.
  39. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations (ICLR).
  40. Improved baselines with visual instruction tuning.
  41. Visual instruction tuning. In NeurIPS.
  42. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
  43. Ilya Loshchilov and Frank Hutter. 2017. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations.
  44. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
  45. Text rendering strategies for pixel language models. In Empirical Methods in Natural Language Processing (EMNLP), pages 10155–10172.
  46. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems (NeurIPS), 32.
  47. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279.
  48. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209.
  49. Glyce: Glyph-vectors for chinese character representations. Advances in Neural Information Processing Systems (NeurIPS), 32.
  50. OpenAI. 2023. GPT-4 Technical Report.
  51. Going full-tilt boogie on document understanding with text-image-layout transformer. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16, pages 732–747. Springer.
  52. Gemini vs gpt-4v: A preliminary comparison and combination of vision-language models through qualitative cases. arXiv preprint arXiv:2312.15011.
  53. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), pages 8748–8763.
  54. Exploring the limits of transfer learning with a unified text-to-text Transformer. The Journal of Machine Learning Research (JMLR), 21(140).
  55. SQuAD: 100,000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP).
  56. Language modelling with pixels. In International Conference on Learning Representations (ICLR).
  57. Robust open-vocabulary translation from visual text representations. In Empirical Methods in Natural Language Processing (EMNLP), pages 7235–7252.
  58. From pixels to UI actions: Learning to follow instructions via graphical user interfaces. In Advances in Neural Information Processing Systems (NeurIPS).
  59. Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202.
  60. Recursive deep models for semantic compositionality over a sentiment treebank. In Empirical Methods in Natural Language Processing (EMNLP).
  61. Pixar: Auto-regressive language modeling in pixel space. arXiv preprint arXiv:2401.03321.
  62. TogetherAI. 2023. Redpajama: An open source recipe to reproduce llama training dataset.
  63. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971.
  64. Clippo: Image-and-language understanding from pixels only. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11006–11017.
  65. Attention is all you need. Advances in Neural Information Processing Systems (NIPS), 30.
  66. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).
  67. Cogvlm: Visual expert for pretrained language models.
  68. Neural network acceptability judgments. Transactions of the Association of Computational Linguistics (TACL), 7.
  69. Should you mask 15% in masked language modeling? In European Chapter of the Association for Computational Linguistics (EACL), pages 2985–3000.
  70. A broad-coverage challenge corpus for sentence understanding through inference. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).
  71. Transformers: State-of-the-art natural language processing. In Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations.
  72. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694.
  73. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
  74. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  75. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
  76. Efficient end-to-end visual document understanding with rationale distillation. arXiv preprint arXiv:2311.09612.
  77. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Tianyu Gao (35 papers)
  2. Zirui Wang (83 papers)
  3. Adithya Bhaskar (9 papers)
  4. Danqi Chen (84 papers)
Citations (6)

Summary

  • The paper introduces the Patch-and-Text Prediction objective to jointly recover masked image patches and text, enhancing language understanding in screenshot LMs.
  • It achieves up to an 8% performance improvement on GLUE benchmarks, rivaling text-only models like BERT in language comprehension tasks.
  • Extensive ablation studies demonstrate that a balanced approach between image and text prediction reduces perplexity and improves the versatility of SLMs.

Improving Language Understanding in Screenshot-Based LLMs Through Patch-and-Text Prediction

Introduction to Screenshot LLMs

The development of LLMs (LMs) capable of processing both textual and visual inputs in a unified framework has opened new avenues for tasks that require complex understanding, such as document interpretation, chart reading, and user interface navigation. Screenshot LLMs (SLMs) represent a promising direction in this area, leveraging the rich information available in screenshots encompassing text, images, charts, and tables. These models offer the potential to handle visually-situated text in an end-to-end manner, bypassing the limitations associated with separate processing of image and text data.

The Challenge: Language Understanding Gap

Despite the potential of SLMs, their performance on language understanding tasks significantly lags behind that of text-only LMs. This performance gap hinders the practical application of SLMs in scenarios where linguistic comprehension is crucial. Prior works have demonstrated the promise of SLMs in specific contexts, such as multilingual transfer, historical document understanding, and chart/UI interpretation. However, the inherent modality mismatch between visual inputs and textual outputs poses challenges in effectively processing text within screenshots.

Our Approach: Patch-and-Text Prediction (PTP)

To address the shortcomings in language understanding capabilities of SLMs, we introduce the Patch-and-Text Prediction (PTP) training objective. Unlike previous approaches that focus exclusively on either image patches or text prediction, the PTP objective concurrently targets the recovery of both masked image patches and text within screenshots. This dual-focused objective enables our model to learn local visual features of the text and derive language understanding from the visual representation, thus enhancing its linguistic capabilities.

Key Contributions and Findings

Our work presents several significant contributions:

  • The Patch-and-Text Prediction (PTP) objective demonstrates substantial improvements in the language understanding performance of SLMs, achieving results comparable to BERT on multiple GLUE tasks and exceeding previous SLM benchmarks by up to 8%.
  • Extensive ablation studies on masking rates and patch sizes reveal the importance of balancing image and text prediction tasks to optimize performance.
  • The extension of PTP to autoregressive SLMs, incorporating a single decoder design, shows effectiveness in utilizing screenshot context to reduce perplexity on subsequent text.

Advancing Screenshot LLMs

The success of the PTP objective in enhancing the language understanding ability of screenshot LMs opens new possibilities for their application, beyond the realms traditionally dominated by text-only models. It narrows the performance gap in language understanding tasks, setting a foundation for more powerful and versatile SLMs capable of navigating the increasingly multimodal nature of digital information.

Future Directions and Speculations

While our work marks a significant step forward, the field of screenshot LLMs remains ripe for further exploration. Potential avenues for future research include investigating the incorporation of real-world screenshots, improving the efficiency and stability of SLM training, and exploring novel applications unattainable by text-only LMs. The continuous evolution of SLMs promises to broaden their applicability, making them indispensable tools for navigating and interpreting the visual and textual fabric of the digital world.

Acknowledgements and Supporting Information

The development of the Patch-and-Text Prediction objective and the subsequent improvements in SLM performance are the result of collaborative efforts and valuable feedback from the research community. The open-source code and additional details on the implementation, training, and evaluation of our models are made available, encouraging further experimentation and development in this exciting field of research.