Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Noise-Aware Training of Layout-Aware Language Models (2404.00488v1)

Published 30 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: A visually rich document (VRD) utilizes visual features along with linguistic cues to disseminate information. Training a custom extractor that identifies named entities from a document requires a large number of instances of the target document type annotated at textual and visual modalities. This is an expensive bottleneck in enterprise scenarios, where we want to train custom extractors for thousands of different document types in a scalable way. Pre-training an extractor model on unlabeled instances of the target document type, followed by a fine-tuning step on human-labeled instances does not work in these scenarios, as it surpasses the maximum allowable training time allocated for the extractor. We address this scenario by proposing a Noise-Aware Training method or NAT in this paper. Instead of acquiring expensive human-labeled documents, NAT utilizes weakly labeled documents to train an extractor in a scalable way. To avoid degradation in the model's quality due to noisy, weakly labeled samples, NAT estimates the confidence of each training sample and incorporates it as uncertainty measure during training. We train multiple state-of-the-art extractor models using NAT. Experiments on a number of publicly available and in-house datasets show that NAT-trained models are not only robust in performance -- it outperforms a transfer-learning baseline by up to 6% in terms of macro-F1 score, but it is also more label-efficient -- it reduces the amount of human-effort required to obtain comparable performance by up to 73%.

Summary

  • The paper introduces NAT, reducing the need for human-labeled data by up to 73% while boosting extraction performance.
  • It fine-tunes pre-trained layout-aware models using a noise-aware loss to effectively combine weakly-labeled, human-labeled, and synthetic data.
  • Experiments on datasets like CORD and FUNSD show a 6% macro-F1 improvement, highlighting NAT's scalable application in enterprise VRD processing.

Noise-Aware Training of Layout-Aware LLMs: A Synopsis

The presented paper addresses the challenge of extracting named entities from visually rich documents (VRDs) in a scalable manner. Traditional methods for training extractors for such documents rely heavily on supervision in both textual and visual modalities, an approach which can often become a costly bottleneck in enterprise applications. The novel solution proposed by the authors, Noise-Aware Training (NAT), circumvents the need for large volumes of human-labeled data by employing a semi-supervised learning framework that incorporates weakly labeled data and noise-aware optimization techniques.

The authors introduce the complexity of VRDs, which include documents such as invoices and forms that require both textual and visual context to be parsed effectively. NAT leverages unlabeled data efficiently by assigning weak labels using machine-learning models rather than relying on manually annotated datasets. This allows for the model to be trained using limited labeled data, supplemented with weakly-labeled noise-aware data without degrading the performance of the extractor.

Methodological Approach

The NAT framework unfolds in three main phases:

  1. Pre-training: The model is initialized with a pre-trained extractor using either LayoutLMV2 or FormNet, both of which are state-of-the-art models in handling spatial and textual information conjointly. This step is crucial in ensuring that the extractor inherits rich contextual embeddings from a domain-agnostic pre-trained model.
  2. Noise-Aware Fine-Tuning: In this phase, the model is fine-tuned using a mixture of weakly labeled data, derived from unlabeled document instances, and a smaller set of human-labeled documents. This process is managed by a novel noise-aware loss function, which adjusts the training signal based on the confidence of the weakly labeled samples. Uncertain or noisy labels are down-weighted to mitigate their potential negative impact on the model's refinement process.
  3. Synthetic Data Augmentation: The third phase involves generating synthetic documents, constructed through a series of data augmentation techniques such as synonym substitution and format transformation, which are then used to further fine-tune the models to ensure robustness against variability in document presentation.

Experimental Validation

The efficacy of NAT is demonstrated through extensive experiments across multiple datasets, including both public datasets such as CORD and FUNSD, and proprietary datasets of utility bills and French invoices. Key results indicate that NAT-trained models outperform transfer-learning baselines by approximately 6% in macro-F1 score, while simultaneously reducing the reliance on human-labeled data by up to 73%. These results were consistent across different document types, underscoring the versatility of the NAT mechanism.

Implications and Future Directions

The findings from this research present significant implications for automating information extraction from VRDs across diverse enterprise applications. The reduction in labeling effort combined with improved extraction performance positions NAT as a promising approach for scalable deployment of VRD extraction systems. The technique also opens pathways for further exploration into more complex multi-modal contexts involving video or augmented reality, where heterogeneous data interpretations continue to be a challenge.

Moreover, future works could investigate enhancements to the nature of weak supervision signals, potentially incorporating self-supervised learning objectives and adversarial training to bolster NAT's robustness against varying noise levels. Additionally, expanding the framework to support real-time document processing systems could further embed NAT's utility in dynamic business environments.

In summary, this paper makes a substantial contribution to the domain of semi-supervised learning for VRD information extraction, with clear practical applications and a solid foundation for advancing the theoretical underpinnings of noise-aware model training.