Enhancing Vision-Language Pre-training with Rich Supervisions
Abstract: We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-LLMs using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1% on Widget Captioning.
- Docformer: End-to-end transformer for document understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 993–1003, 2021.
- Docformerv2: Local features for document understanding. AAAI, abs/2306.01733, 2024.
- Mlim: Vision-and-language model pre-training with masked language and image modeling. arXiv preprint arXiv:2109.12178, 2021.
- Uibert: Learning generic multimodal representations for ui understanding. arXiv preprint arXiv:2107.13731, 2021.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Latr: Layout-aware transformer for scene-text vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16548–16558, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Vlp: A survey on vision-language pre-training. Machine Intelligence Research, 20:38–56, 2022a.
- Websrc: A dataset for web-based structural reading comprehension, 2021.
- Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022b.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Coarse-to-fine vision-language pre-training with fusion in the backbone. ArXiv, abs/2206.07643, 2022.
- A survey of vision-language pre-trained models. In International Joint Conference on Artificial Intelligence, 2022.
- Icdar 2019 competition on table detection and recognition (ctdar). In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1510–1515, 2019.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Clip-adapter: Better vision-language models with feature adapters. ArXiv, abs/2110.04544, 2021.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
- Scaling up vision-language pretraining for image captioning. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17959–17968, 2021.
- Seeing out of the box: End-to-end pre-training for vision-language representation learning. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12971–12980, 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 2021.
- A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. In Annual Meeting of the Association for Computational Linguistics, 2021.
- Mdetr - modulated detection for end-to-end multi-modal understanding. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1760–1770, 2021.
- Ocr-free document understanding transformer, 2022.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
- Masked vision and language modeling for multi-modal representation learning. arXiv preprint arXiv:2208.02131, 2022.
- Learning instance occlusion for panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10720–10729, 2020.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding, 2023.
- Spotlight: Mobile ui understanding using vision-language models with a focus. 2023.
- Markuplm: Pre-training of text and markup language for visually-rich document understanding. 2021a.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 2022.
- Grounded language-image pre-training. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10955–10965, 2021b.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. ECCV 2020, 2020a.
- Widget captioning: Generating natural language description for mobile user interface elements, 2020b.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Neural Information Processing Systems, 2019.
- Unified-io: A unified model for vision, language, and multi-modal tasks. ArXiv, abs/2206.08916, 2022.
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Improving language understanding by generative pre-training. 2018.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021a.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021b.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019.
- Test-time prompt tuning for zero-shot generalization in vision-language models. ArXiv, abs/2209.07511, 2022.
- PubTables-1M: Towards comprehensive table extraction from unstructured documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4634–4642, 2022.
- Screen2words: Automatic mobile ui summarization with multimodal learning, 2021a.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, 2022a.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022b.
- Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. ArXiv, abs/2111.02358, 2021b.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. ArXiv, abs/2208.10442, 2022c.
- Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022.
- E2e-vlp: End-to-end vision-language pre-training enhanced by visual learning. ArXiv, abs/2106.01804, 2021.
- Vision-language pre-training with triple contrastive learning. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15650–15659, 2022.
- Causal attention for vision-language tasks. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9842–9852, 2021.
- Cpt: Colorful prompt tuning for pre-trained vision-language models. ArXiv, abs/2109.11797, 2021.
- Socratic models: Composing zero-shot multimodal reasoning with language. ArXiv, abs/2204.00598, 2022.
- Multi-grained vision language pre-training: Aligning texts with visual concepts. ArXiv, abs/2111.08276, 2021.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. ArXiv, abs/2203.03605, 2022a.
- Glipv2: Unifying localization and vision-language understanding. ArXiv, abs/2206.05836, 2022b.
- Vinvl: Making visual representations matter in vision-language models. ArXiv, abs/2101.00529, 2021a.
- Vinvl: Making visual representations matter in vision-language models. CVPR 2021, 2021b.
- Musketeer (all for one, and one for all): A generalist vision-language model with task explanation prompts, 2023.
- Publaynet: largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1015–1022. IEEE, 2019.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130:2337 – 2348, 2021.
- Conditional prompt learning for vision-language models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16795–16804, 2022.
- Unified vision-language pre-training for image captioning and vqa. ArXiv, abs/1909.11059, 2019.
- Kaleido-bert: Vision-language pre-training on fashion domain. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12642–12652, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.