Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Web-based Visual Corpus Construction for Visual Document Understanding (2211.03256v2)

Published 7 Nov 2022 in cs.CV, cs.AI, and cs.LG

Abstract: In recent years, research on visual document understanding (VDU) has grown significantly, with a particular emphasis on the development of self-supervised learning methods. However, one of the significant challenges faced in this field is the limited availability of publicly accessible visual corpora or extensive collections of images with detailed text annotations, particularly for non-Latin or resource-scarce languages. To address this challenge, we propose Web-based Visual Corpus Builder (Webvicob), a dataset generator engine capable of constructing large-scale, multilingual visual corpora from raw Wikipedia HTML dumps. Our experiments demonstrate that the data generated by Webvicob can be used to train robust VDU models that perform well on various downstream tasks, such as DocVQA and post-OCR parsing. Furthermore, when using a dataset of 1 million images generated by Webvicob, we observed an improvement of over 13% on the DocVQA Task 3 compared to a dataset of 11 million images from the IIT-CDIP. The implementation of our engine is publicly available on https://github.com/clovaai/webvicob

Citations (2)

Summary

  • The paper introduces Webvicob, a robust engine that generates large-scale multilingual visual corpora from Wikipedia HTML dumps without relying on OCR.
  • It leverages precise hierarchical text annotations and rich font diversity to significantly boost VDU performance, reporting a 13% improvement on the DocVQA task with only 1 million images.
  • The approach democratizes access to multilingual datasets by enhancing the training efficiency of state-of-the-art models like BROS and LayoutXLM, especially for low-resource languages.

Webvicob: Building Multilingual Visual Corpora for Enhanced VDU

The advancement of Visual Document Understanding (VDU) has been marked by the development of self-supervised learning methods leveraging large datasets. However, the paucity of publicly accessible and well-annotated visual corpora, particularly for non-Latin and low-resource languages, remains a critical bottleneck. This paper introduces Webvicob, an innovative dataset generator designed to construct multilingual visual corpora from raw Wikipedia HTML dumps, addressing the aforementioned limitations.

Core Contributions

The principal contribution of this work is the development of Webvicob, a robust engine capable of generating large-scale, multilingual visual corpora. This engine bypasses the need for OCR engines for text extraction, which are often costly and limited in accuracy—particularly in non-Latin contexts. Webvicob directly captures text annotations from rendered Wikipedia HTML pages, ensuring high fidelity and comprehensive coverage across 270 languages. The paper reports substantial improvements in downstream VDU tasks such as DocVQA and post-OCR parsing when models are pretrained on datasets generated using Webvicob. For example, a model pretrained on 1 million Webvicob images demonstrated a performance increase of over 13% on DocVQA Task 3 compared to traditional datasets like IIT-CDIP, which contain 11 million images.

Methodological Approach

Webvicob leverages Wikipedia HTML dumps to produce hierarchical text annotations, capturing character, word, line, and paragraph-level details. The dataset also incorporates rich font diversity, which enhances the visual variance and robustness of the trained models. The methodology involves three key processes:

  1. Rendering webpage content into images with precise text bounding boxes.
  2. Removing non-informative elements to ensure clean and usable annotations.
  3. Generating accurate glyph bounding boxes using DOM APIs and Pygame for font rendering.

Experimental Evaluation

The paper details extensive experiments illustrating the efficacy of Webvicob-generated data. BROS, a state-of-the-art VDU backbone, pretrained on Webvicob data, outperforms its counterparts on several VDU benchmarks. Similarly, LayoutXLM, a multilingual backbone, shows competitive performance when pretrained on Webvicob data across low-resource language tasks. Notably, the performance gains are achieved with fewer training iterations compared to using traditional datasets, indicating enhanced data efficiency.

Implications and Future Directions

The success of Webvicob in creating diverse and high-quality visual corpora lays groundwork for future advances in VDU. By circumventing the dependency on OCR tools, Webvicob provides a scalable and cost-effective solution, particularly adaptable for low-resource languages. The deployment of Webvicob can substantially democratize access to multilingual datasets, fostering broader research and application of VDU technologies.

Future research can focus on expanding Webvicob's dataset coverage by incorporating broader web repositories, such as CommonCrawl, and integrating robust augmentation strategies to further enhance dataset variability. Additionally, exploring the synergies between Webvicob-generated datasets and emerging transformer-based architectures without text dependencies, such as Vision Transformers, may unlock new paradigms in document understanding tasks.

In summary, Webvicob addresses a critical gap in VDU by providing a scalable means of generating high-quality, multilingual visual corpora from widely available web resources. The documented improvements in VDU model performance underscore the potential of Webvicob to catalyze advancements in document intelligence technologies, particularly in resource-constrained settings.

Github Logo Streamline Icon: https://streamlinehq.com