Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Vision-Language Pre-training with Rich Supervisions (2403.03346v1)

Published 5 Mar 2024 in cs.CV
Enhancing Vision-Language Pre-training with Rich Supervisions

Abstract: We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-LLMs using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1% on Widget Captioning.

Enhancing Vision-Language Pre-training with Rich Supervisions from Web Screen Data

Introduction

In the rapidly evolving field of Vision-LLMs (VLMs), leveraging diverse and enriched datasets for pre-training can significantly amplify a model's understanding and interpretation of multimodal inputs. The recent work on "Enhancing Vision-Language Pre-training with Rich Supervisions" takes a novel approach by utilizing large-scale web screenshot rendering to extract a combination of visual and textual cues. This method taps into the inherently structured nature of web content, facilitating the design of ten distinct pre-training tasks. These tasks are not only relevant to real-world applications but also benefit from low-cost, automatically generated annotations.

Dataset and Pre-training Paradigm

The foundation of this approach is a richly annotated dataset derived from rendering web pages into screenshots, while also capturing textual content, spatial localization, and hierarchical relationships among HTML elements. This is achieved by an efficient rendering and supervision extraction pipeline, followed by meticulous data cleaning to ensure quality. The result is a dataset comprising 15 million unique and high-quality vision-language pairs.

Building on this dataset, the pre-training paradigm introduced, referred to as Strongly Supervised pre-training with ScreenShots (S4), encompasses a suite of ten carefully curated tasks. These tasks leverage the rich supervisory signals embedded in the dataset, spanning from Optical Character Recognition (OCR) and Image Grounding to more complex tasks like Table Detection and Screen Titling. By aligning closely with downstream applications, these tasks collectively aim to enhance the pre-trained model's adaptability and performance across a variety of vision-language domains.

Architectural Considerations

The architecture employed in this work adheres to a straightforward design, featuring an image encoder followed by a text decoder. This configuration, akin to models like Pix2Struct and Donut, facilitates the direct processing of images to generate text outputs. However, unique to this work is the extension of the model's vocabulary to include coordinate tokens, enabling the model to handle tasks requiring spatial localization without the need for OCR inputs, thus reducing latency and memory usage.

Empirical Evaluation

The efficacy of the S4 pre-training paradigm is rigorously evaluated across nine downstream tasks, encompassing areas such as Chart and Web Understanding, UI Summarization, and Widget Captioning. The results are compelling, showcasing significant performance improvements across all tasks when compared to baselines pre-trained without the rich supervisory signals proposed in this work. Particularly notable are the gains observed in tasks requiring spatial localizations, such as Table Detection and Referral Expression Comprehension, where improvements of up to 76.1% are reported.

Implications and Future Directions

This work underscores the potential of leveraging web-rendered data for vision-LLM pre-training. By exploiting the structured nature of web content, the S4 pre-training paradigm unlocks new possibilities for enriching the supervisory signals available during the pre-training phase. The remarkable performance uplift observed across a diverse set of downstream tasks highlights the value of this approach.

Looking ahead, the continual expansion of web crawl corpora and advancements in rendering technologies promise even richer datasets for pre-training. Furthermore, the adaptation of the S4 paradigm to newer model architectures and its extension to incorporate emerging pre-training tasks hold great potential for pushing the boundaries of what is achievable in the field of Vision-LLMs.

In conclusion, "Enhancing Vision-Language Pre-training with Rich Supervisions" presents a compelling case for the strategic utilization of web data for model pre-training. The proposed methodology not only sets new performance benchmarks across a range of tasks but also paves the way for future innovations in the field of generative AI and LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Docformer: End-to-end transformer for document understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 993–1003, 2021.
  2. Docformerv2: Local features for document understanding. AAAI, abs/2306.01733, 2024.
  3. Mlim: Vision-and-language model pre-training with masked language and image modeling. arXiv preprint arXiv:2109.12178, 2021.
  4. Uibert: Learning generic multimodal representations for ui understanding. arXiv preprint arXiv:2107.13731, 2021.
  5. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  6. Latr: Layout-aware transformer for scene-text vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16548–16558, 2022.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Vlp: A survey on vision-language pre-training. Machine Intelligence Research, 20:38–56, 2022a.
  9. Websrc: A dataset for web-based structural reading comprehension, 2021.
  10. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022b.
  11. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  12. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  15. Coarse-to-fine vision-language pre-training with fusion in the backbone. ArXiv, abs/2206.07643, 2022.
  16. A survey of vision-language pre-trained models. In International Joint Conference on Artificial Intelligence, 2022.
  17. Icdar 2019 competition on table detection and recognition (ctdar). In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1510–1515, 2019.
  18. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  19. Clip-adapter: Better vision-language models with feature adapters. ArXiv, abs/2110.04544, 2021.
  20. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  21. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  22. Scaling up vision-language pretraining for image captioning. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17959–17968, 2021.
  23. Seeing out of the box: End-to-end pre-training for vision-language representation learning. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12971–12980, 2021.
  24. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 2021.
  25. A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. In Annual Meeting of the Association for Computational Linguistics, 2021.
  26. Mdetr - modulated detection for end-to-end multi-modal understanding. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1760–1770, 2021.
  27. Ocr-free document understanding transformer, 2022.
  28. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  29. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
  30. Masked vision and language modeling for multi-modal representation learning. arXiv preprint arXiv:2208.02131, 2022.
  31. Learning instance occlusion for panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10720–10729, 2020.
  32. Pix2struct: Screenshot parsing as pretraining for visual language understanding, 2023.
  33. Spotlight: Mobile ui understanding using vision-language models with a focus. 2023.
  34. Markuplm: Pre-training of text and markup language for visually-rich document understanding. 2021a.
  35. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 2022.
  36. Grounded language-image pre-training. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10955–10965, 2021b.
  37. Oscar: Object-semantics aligned pre-training for vision-language tasks. ECCV 2020, 2020a.
  38. Widget captioning: Generating natural language description for mobile user interface elements, 2020b.
  39. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  40. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Neural Information Processing Systems, 2019.
  41. Unified-io: A unified model for vision, language, and multi-modal tasks. ArXiv, abs/2206.08916, 2022.
  42. Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022.
  43. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  44. Improving language understanding by generative pre-training. 2018.
  45. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021a.
  46. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021b.
  47. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  48. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  49. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  50. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  51. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019.
  52. Test-time prompt tuning for zero-shot generalization in vision-language models. ArXiv, abs/2209.07511, 2022.
  53. PubTables-1M: Towards comprehensive table extraction from unstructured documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4634–4642, 2022.
  54. Screen2words: Automatic mobile ui summarization with multimodal learning, 2021a.
  55. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, 2022a.
  56. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022b.
  57. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. ArXiv, abs/2111.02358, 2021b.
  58. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. ArXiv, abs/2208.10442, 2022c.
  59. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022.
  60. E2e-vlp: End-to-end vision-language pre-training enhanced by visual learning. ArXiv, abs/2106.01804, 2021.
  61. Vision-language pre-training with triple contrastive learning. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15650–15659, 2022.
  62. Causal attention for vision-language tasks. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9842–9852, 2021.
  63. Cpt: Colorful prompt tuning for pre-trained vision-language models. ArXiv, abs/2109.11797, 2021.
  64. Socratic models: Composing zero-shot multimodal reasoning with language. ArXiv, abs/2204.00598, 2022.
  65. Multi-grained vision language pre-training: Aligning texts with visual concepts. ArXiv, abs/2111.08276, 2021.
  66. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. ArXiv, abs/2203.03605, 2022a.
  67. Glipv2: Unifying localization and vision-language understanding. ArXiv, abs/2206.05836, 2022b.
  68. Vinvl: Making visual representations matter in vision-language models. ArXiv, abs/2101.00529, 2021a.
  69. Vinvl: Making visual representations matter in vision-language models. CVPR 2021, 2021b.
  70. Musketeer (all for one, and one for all): A generalist vision-language model with task explanation prompts, 2023.
  71. Publaynet: largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1015–1022. IEEE, 2019.
  72. Learning to prompt for vision-language models. International Journal of Computer Vision, 130:2337 – 2348, 2021.
  73. Conditional prompt learning for vision-language models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16795–16804, 2022.
  74. Unified vision-language pre-training for image captioning and vqa. ArXiv, abs/1909.11059, 2019.
  75. Kaleido-bert: Vision-language pre-training on fashion domain. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12642–12652, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Yuan Gao (335 papers)
  2. Kunyu Shi (4 papers)
  3. Pengkai Zhu (9 papers)
  4. Edouard Belval (2 papers)
  5. Oren Nuriel (8 papers)
  6. Srikar Appalaraju (21 papers)
  7. Shabnam Ghadar (2 papers)
  8. Vijay Mahadevan (16 papers)
  9. Zhuowen Tu (80 papers)
  10. Stefano Soatto (179 papers)
Citations (9)