Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism (2404.19024v1)

Published 29 Apr 2024 in cs.CV

Abstract: Documents are 2-dimensional carriers of written communication, and as such their interpretation requires a multi-modal approach where textual and visual information are efficiently combined. Document Visual Question Answering (Document VQA), due to this multi-modal nature, has garnered significant interest from both the document understanding and natural language processing communities. The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle. They have to concatenate all pages into one large page for processing, demanding substantial GPU resources, even for evaluation. In this work, we propose a novel method and efficient training strategy for multi-page Document VQA tasks. In particular, we employ a visual-only document representation, leveraging the encoder from a document understanding model, Pix2Struct. Our approach utilizes a self-attention scoring mechanism to generate relevance scores for each document page, enabling the retrieval of pertinent pages. This adaptation allows us to extend single-page Document VQA models to multi-page scenarios without constraints on the number of pages during evaluation, all with minimal demand for GPU resources. Our extensive experiments demonstrate not only achieving state-of-the-art performance without the need for Optical Character Recognition (OCR), but also sustained performance in scenarios extending to documents of nearly 800 pages compared to a maximum of 20 pages in the MP-DocVQA dataset. Our code is publicly available at \url{https://github.com/leitro/SelfAttnScoring-MPDocVQA}.

PDF HTML Abstract

The paper introduces a novel method for multi-page Document Visual Question Answering (Document VQA) that leverages a self-attention scoring mechanism to enhance performance and efficiency. The approach builds upon the Pix2Struct model and introduces an efficient training strategy tailored for multi-page scenarios.

The paper addresses the limitations of existing Document VQA methods, particularly their struggle with multi-page documents due to the high computational resources required to process concatenated pages. The method uses a visual-only document representation, utilizing the encoder from Pix2Struct. The self-attention scoring mechanism generates relevance scores for each document page, enabling the retrieval of pertinent pages, which extends single-page Document VQA models to multi-page scenarios without constraints on the number of pages during evaluation, and with minimal demand for GPU resources.

Key components and contributions of the work include:

A self-attention scoring mechanism that adapts single-page Document VQA to multi-page scenarios while minimizing GPU resource usage.
An efficient training scheme incorporating one positive and one randomly selected negative page per document to train the scoring mechanism.
A perspective on aligning textual and visual modalities by converting them into pixel-based representations.
An evaluation extending the MP-DocVQA dataset to include documents up to 793 pages, demonstrating performance in extended evaluation.

The method involves a two-stage training scheme. In the first stage, a Document VQA model is trained on single-page scenarios using positive question-page pairs $(Q_i, p_k)$ . In the second stage, the contextual feature $F$ extracted from the frozen encoder of the first stage is used to produce a matching score between 0 and 1, indicating the relevance of the question to the page.

The self-attention scoring module consists of a self-attention layer and three linear layers with a dropout layer in between. The input is the encoded feature $F$ . The first vector is selected, serving as the matching representation. A matching score between 0 and 1 is produced by passing the representation through a dropout layer and three linear layers.

During training, pages containing evidence for answers are considered positive, while others are negative. Only one negative page $p_r$ is randomly selected for each question-document input $(Q_i, D_j)$ to balance the training data. The encoded feature $F$ is fed into the self-attention scoring module, resulting in a matching score representing a probability from 0 to 1. The training is performed using the Mean Squared Error (MSE) loss with label smoothing.

In testing, each page $p_t$ of the document $D_j$ is processed with question $Q_i$ as input to the encoder, obtaining the question-page pair contextual feature $F$ . The scoring module generates a matching score for the pair $(Q_i, p_t)$ . A Top-1 filtering module selects the question-page pair with the highest probability. The full encoder-decoder pipeline is then applied to generate the answer.

The paper details experiments conducted on the MP-DocVQA dataset, which contains 46,000 questions across 6,000 documents, each limited to 20 pages. To evaluate the method's robustness, the page length restrictions were removed, allowing for documents of up to 800 pages in the test set. The evaluation metrics used were Average Normalized Levenshtein Similarity (ANLS) and accuracy (%) for page prediction.

Hyper-parameter searches were conducted to optimize the number of self-attention layers and attention heads, with validation using page prediction accuracy and ANLS. Ablation studies were performed on aggregation methods within the self-attention scoring module, comparing the use of a special [CLS] token, adaptive average pooling, and selecting the first vector.

The results showed that the proposed method outperforms state-of-the-art methods in page prediction, with an accuracy of 81.55%, and achieves a competitive ANLS score of 0.6199. It matches the performance of Hi-VT5 without needing Optical Character Recognition (OCR) annotations and using fewer parameters. The method's ability to handle lengthy documents without page limitations was also highlighted.

Further experiments evaluated performance in unrestricted scenarios, extending the number of pages in each document to the complete original number of pages. While page prediction performance decreased by approximately 25%, ANLS performance declined by less than 13%. Analysis of validation data revealed cases where incorrect page prediction still resulted in correct answers, attributed to the presence of supporting evidence on multiple pages.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Lei Kang (27 papers)
Rubèn Tito (12 papers)
Ernest Valveny (28 papers)
Dimosthenis Karatzas (80 papers)

Citations (2)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/CSVisionPapers/status/1785552862220529765