DOCMASTER: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering (2404.00439v1)

Published 30 Mar 2024 in cs.CL

Abstract: The application of natural language processing models to PDF documents is pivotal for various business applications yet the challenge of training models for this purpose persists in businesses due to specific hurdles. These include the complexity of working with PDF formats that necessitate parsing text and layout information for curating training data and the lack of privacy-preserving annotation tools. This paper introduces DOCMASTER, a unified platform designed for annotating PDF documents, model training, and inference, tailored to document question-answering. The annotation interface enables users to input questions and highlight text spans within the PDF file as answers, saving layout information and text spans accordingly. Furthermore, DOCMASTER supports both state-of-the-art layout-aware and text models for comprehensive training purposes. Importantly, as annotations, training, and inference occur on-device, it also safeguards privacy. The platform has been instrumental in driving several research prototypes concerning document analysis such as the AI assistant utilized by University of California San Diego's (UCSD) International Services and Engagement Office (ISEO) for processing a substantial volume of PDF documents.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a unified platform that integrates annotation, training, and on-device inference for document QA while preserving data privacy.
The paper leverages robust PDF annotation tools, such as PDF.js and PyMuPDF, to accurately map text for training layout-aware models.
The paper demonstrates significant efficiency gains, including a sevenfold increase in document processing throughput during deployment.

Unified Document-QA Platform with Privacy Preservation and On-Device Processing

Introduction to the Platform

The paper details a platform specifically designed for annotating, training, and inferring in document-based question-answering tasks. It primarily addresses the complexities of handling PDF documents, emphasizes on-device data processing for privacy preservation, and allows comprehensive handling of both layout-aware and text-based models. Importantly, the platform encapsulates functions spanning the entire workflow including data annotation, model training, and inference, entirely within the users' devices, thus bolstering data security.

Platform Design and Features

Annotation Interface

The annotation aspect of the platform involves a multi-faceted approach enabling users to upload PDF files, pose questions, and mark corresponding answers within the document. This interface supports:

Accurate Text Highlighting: Thanks to the integration of PDF.js and PyMuPDF, users experience a robust annotation environment where text selections are precisely mapped to the word-level bounding boxes necessary for training layout-aware models.
Privacy-Focused Data Handling: All data interactions occur on-device with data stored locally, eliminating potential privacy risks associated with third-party data processing.

Training and Model Compatibility

After annotation, users can transition seamlessly to model training:

Flexible Model Training: The platform supports a variety of NLP models including both classic text-based models like RoBERTa and layout-aware models such as LayoutLM.
Collaborative and Incremental Learning: Training data can be collectively used or incrementally added by different users, promoting collaborative improvements and simplifying the model training process.

Inference Capabilities

The inference module extends the utility of the platform by allowing:

Efficient QA: Users submit documents and questions to the trained model, receiving answers highlighted directly in the PDF document, which enhances the user's understanding and interaction with the extracted information.

Practical Deployment and Results

The platform's deployment at the UCSD International Services and Engagement Office (ISEO) illustrates its practical benefits, particularly in automating the verification process for student work permits. This resulted in a significant increase in processing efficiency—specifically, a sevenfold increase in the number of documents processed per hour.

The implementation demonstrated:

High Accuracy and Efficiency: Both RoBERTa-base and LayoutLM-base models performed well, however, correctness scores and bounding box accuracy metrics pointed out the models' practical utility over traditional exact match accuracy in real-world applications.
Enhanced Throughput: Not only did the model provide fast responses but it also handled data intense operations efficiently thanks to the allocated computing resources.

Speculations on Future Developments

Looking forward, the extension of this platform can revolutionize in-house document processing for various sectors requiring stringent data privacy, such as legal and healthcare domains. Further enhancements could include:

Advanced Model Tuning: Tailoring models to specific types of documents or integrating more advanced NLP capabilities could improve both accuracy and processing speed.
Expanded Use Cases: Beyond QA, the platform could be adapted for tasks like document summarization or entity extraction, broadening its applicability.
Increased Automation: Integration with other enterprise systems like HR databases or customer relationship management platforms could automate broader workflows.

Conclusion

The new platform provides a comprehensive, secure, and efficient means of processing document-based inquiries through annotated training and inference, completely in-house. This holds substantial implications for entities handling sensitive or proprietary information, propelling advancements in the field of document AI while firmly adhering to privacy requirements.

PDF Markdown