Improving Language Understanding from Screenshots (2402.14073v1)
Abstract: An emerging family of LLMs (LMs), capable of processing both text and images within a single visual view, has the promise to unlock complex tasks such as chart understanding and UI navigation. We refer to these models as screenshot LLMs. Despite their appeal, existing screenshot LMs substantially lag behind text-only models on language understanding tasks. To close this gap, we adopt a simplified setting where the model inputs are plain-text-rendered screenshots, and we focus on improving the text ability of screenshot LMs. We propose a novel Patch-and-Text Prediction (PTP) objective, which masks and recovers both image patches of screenshots and text within screenshots. We also conduct extensive ablation studies on masking rates and patch sizes, as well as designs for improving training stability. Our pre-trained model, while solely taking visual inputs, achieves comparable performance with BERT on 6 out of 8 GLUE tasks (within 2%) and improves up to 8% over prior work. Additionally, we extend PTP to train autoregressive screenshot LMs and demonstrate its effectiveness--our models can significantly reduce perplexity by utilizing the screenshot context. Together, we hope our findings can inspire future research on developing powerful screenshot LMs and extending their reach to broader applications.
- DUBLIN: Visual document understanding by language-image network. In Empirical Methods in Natural Language Processing (EMNLP), pages 693–706.
- Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems.
- Pixt3: Pixel-based table to text generation. arXiv preprint arXiv:2311.09808.
- Docformer: End-to-end transformer for document understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 993–1003.
- Uibert: Learning generic multimodal representations for ui understanding. In International Joint Conference on Artificial Intelligence (IJCAI).
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.
- The second PASCAL recognising textual entailment challenge.
- Fuyu-8b: A multimodal architecture for ai agents.
- The fifth PASCAL recognizing textual entailment challenge. In TAC.
- PHD: Pixel-based language modeling of historical documents. In Empirical Methods in Natural Language Processing (EMNLP), pages 87–107.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS).
- SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In the 11th International Workshop on Semantic Evaluation (SemEval-2017).
- PaLI: A jointly-scaled multilingual language-image model. In International Conference on Learning Representations (ICLR).
- The PASCAL recognising textual entailment challenge. In the First International Conference on Machine Learning Challenges: Evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment.
- Instructblip: Towards general-purpose vision-language models with instruction tuning.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems (NeurIPS), 35:16344–16359.
- Language modeling with gated convolutional networks. In International Conference on Machine Learning (ICML), pages 933–941. PMLR.
- End-to-end document recognition and understanding with dessurt. In European Conference on Computer Vision, pages 280–296. Springer.
- BERT: Pre-training of deep bidirectional Transformers for language understanding. In North American Chapter of the Association for Computational Linguistics (NAACL).
- William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In the Third International Workshop on Paraphrasing (IWP2005).
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR).
- Palm-e: an embodied multimodal language model. In International Conference on Machine Learning (ICML).
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010.
- Gemini Team. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- The third PASCAL recognizing textual entailment challenge. In the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009.
- Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091.
- pybind11 — seamless operability between c++11 and python. Https://github.com/pybind/pybind11.
- Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning (ICML), pages 18893–18912.
- Building a test collection for complex document information processing. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 665–666.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Association for Computational Linguistics (ACL), pages 7871–7880.
- Otterhd: A high-resolution multi-modality model.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
- Gang Li and Yang Li. 2023. Spotlight: Mobile UI understanding using vision-language models with a focus. In International Conference on Learning Representations (ICLR).
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.
- Renderdiffusion: Text generation as image generation. arXiv preprint arXiv:2304.12519.
- Mapping natural language instructions to mobile UI action sequences. In Association for Computational Linguistics (ACL), pages 8198–8210.
- Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations (ICLR).
- Improved baselines with visual instruction tuning.
- Visual instruction tuning. In NeurIPS.
- RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
- Ilya Loshchilov and Frank Hutter. 2017. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
- Text rendering strategies for pixel language models. In Empirical Methods in Natural Language Processing (EMNLP), pages 10155–10172.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems (NeurIPS), 32.
- ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279.
- Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209.
- Glyce: Glyph-vectors for chinese character representations. Advances in Neural Information Processing Systems (NeurIPS), 32.
- OpenAI. 2023. GPT-4 Technical Report.
- Going full-tilt boogie on document understanding with text-image-layout transformer. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16, pages 732–747. Springer.
- Gemini vs gpt-4v: A preliminary comparison and combination of vision-language models through qualitative cases. arXiv preprint arXiv:2312.15011.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), pages 8748–8763.
- Exploring the limits of transfer learning with a unified text-to-text Transformer. The Journal of Machine Learning Research (JMLR), 21(140).
- SQuAD: 100,000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP).
- Language modelling with pixels. In International Conference on Learning Representations (ICLR).
- Robust open-vocabulary translation from visual text representations. In Empirical Methods in Natural Language Processing (EMNLP), pages 7235–7252.
- From pixels to UI actions: Learning to follow instructions via graphical user interfaces. In Advances in Neural Information Processing Systems (NeurIPS).
- Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Empirical Methods in Natural Language Processing (EMNLP).
- Pixar: Auto-regressive language modeling in pixel space. arXiv preprint arXiv:2401.03321.
- TogetherAI. 2023. Redpajama: An open source recipe to reproduce llama training dataset.
- LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971.
- Clippo: Image-and-language understanding from pixels only. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11006–11017.
- Attention is all you need. Advances in Neural Information Processing Systems (NIPS), 30.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).
- Cogvlm: Visual expert for pretrained language models.
- Neural network acceptability judgments. Transactions of the Association of Computational Linguistics (TACL), 7.
- Should you mask 15% in masked language modeling? In European Chapter of the Association for Computational Linguistics (EACL), pages 2985–3000.
- A broad-coverage challenge corpus for sentence understanding through inference. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).
- Transformers: State-of-the-art natural language processing. In Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations.
- Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
- Efficient end-to-end visual document understanding with rationale distillation. arXiv preprint arXiv:2311.09612.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.
- Tianyu Gao (35 papers)
- Zirui Wang (83 papers)
- Adithya Bhaskar (9 papers)
- Danqi Chen (84 papers)