Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks (2412.04626v2)

Published 5 Dec 2024 in cs.LG and cs.CL

Abstract: Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows, extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

Citations (1)

Summary

  • The paper’s main contribution is the introduction of BigDocs-7.5M, a large-scale dataset for training models on diverse document and code tasks.
  • The paper details a rigorous curation process and benchmark design that improves multimodal document reasoning performance by up to 25.8% over closed-source models.
  • The paper demonstrates that open-access datasets democratize advanced document understanding, fostering innovation in both academic and commercial AI research.

BigDocs: An Open-Access Dataset for Multimodal Model Training on Document and Code Tasks

The advent of multimodal AI has significantly advanced the field of document understanding, which is crucial for processing receipts, workflows, and data extraction from documents. While commercial applications benefit from these advancements, the lack of accessible training datasets and restrictive licensing limits progress in research and public model development. The paper introduces BigDocs-7.5M, an openly-licensed, high-quality dataset consisting of 7.5 million multimodal documents across 30 distinct tasks, designed to extend the capabilities of open-source multimodal models.

Dataset and Benchmark Design

BigDocs-7.5M addresses the limitations in existing datasets by offering a comprehensive, license-permissive resource engineered to enhance multimodal document understanding across three core dimensions: document information extraction, understanding, and creation. Through a meticulous curation process, this dataset ensures high quality by employing strict filtering rules, maintaining traceable metadata, and adhering to transparency principles. The dataset is complemented by BigDocs-Bench, a benchmark suite comprising ten tasks designed to simulate real-world applications, such as GUI reasoning and code generation from images.

Empirical Findings and Evaluation

The paper's experimental framework demonstrates that models trained using BigDocs surpass their counterparts utilizing closed-source datasets like GPT-4o in tasks involving document reasoning and the generation of structured outputs. Empirical evaluation shows an average performance boost of up to 25.8% over the baseline, highlighting the efficacy of BigDocs. The human evaluative process further endorses these findings, with a marked preference for outputs from models trained on BigDocs, reinforcing the dataset's utility in training robust AI.

Potential and Implications

BigDocs represents a significant step towards democratizing access to multimodal document understanding capabilities. By harnessing a large-scale, permissively-licensed dataset, this initiative promises to equip the academic and open-source communities with the tools necessary for developing advanced document understanding technologies. Such resources can catalyze enhancements in foundation models, leading to more efficient data extraction, information synthesis, and document creation.

Future Directions

Anticipating further developments, the framework established by BigDocs could pave the way for increasingly sophisticated applications of multimodal AI. Future iterations of BigDocs may focus on expanding the dataset's scope to include more diverse document formats and enhancing the capability of foundation models to interpret complex data structures. Additionally, exploring the integration of these models into commercial applications could yield insights into optimizing document-related processes across industries.

In conclusion, BigDocs-7.5M and BigDocs-Bench represent a pivotal resource fostering open-access research and development in multimodal AI. Their contribution underscores the importance of transparency and accessibility in advancing the field, providing a cornerstone for future innovations in document understanding technologies.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub