The paper introduces a novel method for multi-page Document Visual Question Answering (Document VQA) that leverages a self-attention scoring mechanism to enhance performance and efficiency. The approach builds upon the Pix2Struct model and introduces an efficient training strategy tailored for multi-page scenarios.
The paper addresses the limitations of existing Document VQA methods, particularly their struggle with multi-page documents due to the high computational resources required to process concatenated pages. The method uses a visual-only document representation, utilizing the encoder from Pix2Struct. The self-attention scoring mechanism generates relevance scores for each document page, enabling the retrieval of pertinent pages, which extends single-page Document VQA models to multi-page scenarios without constraints on the number of pages during evaluation, and with minimal demand for GPU resources.
Key components and contributions of the work include:
- A self-attention scoring mechanism that adapts single-page Document VQA to multi-page scenarios while minimizing GPU resource usage.
- An efficient training scheme incorporating one positive and one randomly selected negative page per document to train the scoring mechanism.
- A perspective on aligning textual and visual modalities by converting them into pixel-based representations.
- An evaluation extending the MP-DocVQA dataset to include documents up to 793 pages, demonstrating performance in extended evaluation.
The method involves a two-stage training scheme. In the first stage, a Document VQA model is trained on single-page scenarios using positive question-page pairs . In the second stage, the contextual feature extracted from the frozen encoder of the first stage is used to produce a matching score between 0 and 1, indicating the relevance of the question to the page.
The self-attention scoring module consists of a self-attention layer and three linear layers with a dropout layer in between. The input is the encoded feature . The first vector is selected, serving as the matching representation. A matching score between 0 and 1 is produced by passing the representation through a dropout layer and three linear layers.
During training, pages containing evidence for answers are considered positive, while others are negative. Only one negative page is randomly selected for each question-document input to balance the training data. The encoded feature is fed into the self-attention scoring module, resulting in a matching score representing a probability from 0 to 1. The training is performed using the Mean Squared Error (MSE) loss with label smoothing.
In testing, each page of the document is processed with question as input to the encoder, obtaining the question-page pair contextual feature . The scoring module generates a matching score for the pair . A Top-1 filtering module selects the question-page pair with the highest probability. The full encoder-decoder pipeline is then applied to generate the answer.
The paper details experiments conducted on the MP-DocVQA dataset, which contains 46,000 questions across 6,000 documents, each limited to 20 pages. To evaluate the method's robustness, the page length restrictions were removed, allowing for documents of up to 800 pages in the test set. The evaluation metrics used were Average Normalized Levenshtein Similarity (ANLS) and accuracy (%) for page prediction.
Hyper-parameter searches were conducted to optimize the number of self-attention layers and attention heads, with validation using page prediction accuracy and ANLS. Ablation studies were performed on aggregation methods within the self-attention scoring module, comparing the use of a special [CLS]
token, adaptive average pooling, and selecting the first vector.
The results showed that the proposed method outperforms state-of-the-art methods in page prediction, with an accuracy of 81.55%, and achieves a competitive ANLS score of 0.6199. It matches the performance of Hi-VT5 without needing Optical Character Recognition (OCR) annotations and using fewer parameters. The method's ability to handle lengthy documents without page limitations was also highlighted.
Further experiments evaluated performance in unrestricted scenarios, extending the number of pages in each document to the complete original number of pages. While page prediction performance decreased by approximately 25%, ANLS performance declined by less than 13%. Analysis of validation data revealed cases where incorrect page prediction still resulted in correct answers, attributed to the presence of supporting evidence on multiple pages.