Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

175

Enhancing Vision-Language Pre-training with Rich Supervisions (2403.03346v1)

Published 5 Mar 2024 in cs.CV

Abstract: We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-LLMs using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1% on Widget Captioning.

PDF HTML Abstract

Enhancing Vision-Language Pre-training with Rich Supervisions from Web Screen Data

Introduction

In the rapidly evolving field of Vision-LLMs (VLMs), leveraging diverse and enriched datasets for pre-training can significantly amplify a model's understanding and interpretation of multimodal inputs. The recent work on "Enhancing Vision-Language Pre-training with Rich Supervisions" takes a novel approach by utilizing large-scale web screenshot rendering to extract a combination of visual and textual cues. This method taps into the inherently structured nature of web content, facilitating the design of ten distinct pre-training tasks. These tasks are not only relevant to real-world applications but also benefit from low-cost, automatically generated annotations.

Dataset and Pre-training Paradigm

The foundation of this approach is a richly annotated dataset derived from rendering web pages into screenshots, while also capturing textual content, spatial localization, and hierarchical relationships among HTML elements. This is achieved by an efficient rendering and supervision extraction pipeline, followed by meticulous data cleaning to ensure quality. The result is a dataset comprising 15 million unique and high-quality vision-language pairs.

Building on this dataset, the pre-training paradigm introduced, referred to as Strongly Supervised pre-training with ScreenShots (S4), encompasses a suite of ten carefully curated tasks. These tasks leverage the rich supervisory signals embedded in the dataset, spanning from Optical Character Recognition (OCR) and Image Grounding to more complex tasks like Table Detection and Screen Titling. By aligning closely with downstream applications, these tasks collectively aim to enhance the pre-trained model's adaptability and performance across a variety of vision-language domains.

Architectural Considerations

The architecture employed in this work adheres to a straightforward design, featuring an image encoder followed by a text decoder. This configuration, akin to models like Pix2Struct and Donut, facilitates the direct processing of images to generate text outputs. However, unique to this work is the extension of the model's vocabulary to include coordinate tokens, enabling the model to handle tasks requiring spatial localization without the need for OCR inputs, thus reducing latency and memory usage.

Empirical Evaluation

The efficacy of the S4 pre-training paradigm is rigorously evaluated across nine downstream tasks, encompassing areas such as Chart and Web Understanding, UI Summarization, and Widget Captioning. The results are compelling, showcasing significant performance improvements across all tasks when compared to baselines pre-trained without the rich supervisory signals proposed in this work. Particularly notable are the gains observed in tasks requiring spatial localizations, such as Table Detection and Referral Expression Comprehension, where improvements of up to 76.1% are reported.

Implications and Future Directions

This work underscores the potential of leveraging web-rendered data for vision-LLM pre-training. By exploiting the structured nature of web content, the S4 pre-training paradigm unlocks new possibilities for enriching the supervisory signals available during the pre-training phase. The remarkable performance uplift observed across a diverse set of downstream tasks highlights the value of this approach.

Looking ahead, the continual expansion of web crawl corpora and advancements in rendering technologies promise even richer datasets for pre-training. Furthermore, the adaptation of the S4 paradigm to newer model architectures and its extension to incorporate emerging pre-training tasks hold great potential for pushing the boundaries of what is achievable in the field of Vision-LLMs.

In conclusion, "Enhancing Vision-Language Pre-training with Rich Supervisions" presents a compelling case for the strategic utilization of web data for model pre-training. The proposed methodology not only sets new performance benchmarks across a range of tasks but also paves the way for future innovations in the field of generative AI and LLMs.

PDF Markdown Bookmark Chat (Pro)

References (75)

Authors (10)

Yuan Gao (335 papers)
Kunyu Shi (4 papers)
Pengkai Zhu (9 papers)
Edouard Belval (2 papers)
Oren Nuriel (8 papers)
Srikar Appalaraju (21 papers)
Shabnam Ghadar (2 papers)
Vijay Mahadevan (16 papers)
Zhuowen Tu (80 papers)
Stefano Soatto (179 papers)

Citations (9)

View on Semantic Scholar

Tweets

https://twitter.com/_akhaliq/status/1765590458942558625

https://twitter.com/knishimae0531/status/1765890185806675971

https://twitter.com/javaeeeee1/status/1765703058971828503

https://twitter.com/javaeeeee1/status/1766822188898758992