- The paper introduces Webvicob, a robust engine that generates large-scale multilingual visual corpora from Wikipedia HTML dumps without relying on OCR.
- It leverages precise hierarchical text annotations and rich font diversity to significantly boost VDU performance, reporting a 13% improvement on the DocVQA task with only 1 million images.
- The approach democratizes access to multilingual datasets by enhancing the training efficiency of state-of-the-art models like BROS and LayoutXLM, especially for low-resource languages.
Webvicob: Building Multilingual Visual Corpora for Enhanced VDU
The advancement of Visual Document Understanding (VDU) has been marked by the development of self-supervised learning methods leveraging large datasets. However, the paucity of publicly accessible and well-annotated visual corpora, particularly for non-Latin and low-resource languages, remains a critical bottleneck. This paper introduces Webvicob, an innovative dataset generator designed to construct multilingual visual corpora from raw Wikipedia HTML dumps, addressing the aforementioned limitations.
Core Contributions
The principal contribution of this work is the development of Webvicob, a robust engine capable of generating large-scale, multilingual visual corpora. This engine bypasses the need for OCR engines for text extraction, which are often costly and limited in accuracy—particularly in non-Latin contexts. Webvicob directly captures text annotations from rendered Wikipedia HTML pages, ensuring high fidelity and comprehensive coverage across 270 languages. The paper reports substantial improvements in downstream VDU tasks such as DocVQA and post-OCR parsing when models are pretrained on datasets generated using Webvicob. For example, a model pretrained on 1 million Webvicob images demonstrated a performance increase of over 13% on DocVQA Task 3 compared to traditional datasets like IIT-CDIP, which contain 11 million images.
Methodological Approach
Webvicob leverages Wikipedia HTML dumps to produce hierarchical text annotations, capturing character, word, line, and paragraph-level details. The dataset also incorporates rich font diversity, which enhances the visual variance and robustness of the trained models. The methodology involves three key processes:
- Rendering webpage content into images with precise text bounding boxes.
- Removing non-informative elements to ensure clean and usable annotations.
- Generating accurate glyph bounding boxes using DOM APIs and Pygame for font rendering.
Experimental Evaluation
The paper details extensive experiments illustrating the efficacy of Webvicob-generated data. BROS, a state-of-the-art VDU backbone, pretrained on Webvicob data, outperforms its counterparts on several VDU benchmarks. Similarly, LayoutXLM, a multilingual backbone, shows competitive performance when pretrained on Webvicob data across low-resource language tasks. Notably, the performance gains are achieved with fewer training iterations compared to using traditional datasets, indicating enhanced data efficiency.
Implications and Future Directions
The success of Webvicob in creating diverse and high-quality visual corpora lays groundwork for future advances in VDU. By circumventing the dependency on OCR tools, Webvicob provides a scalable and cost-effective solution, particularly adaptable for low-resource languages. The deployment of Webvicob can substantially democratize access to multilingual datasets, fostering broader research and application of VDU technologies.
Future research can focus on expanding Webvicob's dataset coverage by incorporating broader web repositories, such as CommonCrawl, and integrating robust augmentation strategies to further enhance dataset variability. Additionally, exploring the synergies between Webvicob-generated datasets and emerging transformer-based architectures without text dependencies, such as Vision Transformers, may unlock new paradigms in document understanding tasks.
In summary, Webvicob addresses a critical gap in VDU by providing a scalable means of generating high-quality, multilingual visual corpora from widely available web resources. The documented improvements in VDU model performance underscore the potential of Webvicob to catalyze advancements in document intelligence technologies, particularly in resource-constrained settings.