PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering (2404.12720v1)
Abstract: Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD), particularly those dominated by lengthy textual content like research journal articles. Existing studies primarily focus on real-world documents with sparse text, while challenges persist in comprehending the hierarchical semantic relations among multiple pages to locate multimodal components. To address this gap, we propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval. Unlike traditional machine reading comprehension (MRC) tasks, our approach aims to retrieve entire paragraphs containing answers or visually rich document entities like tables and figures. Our contributions include the introduction of a comprehensive PDF Document VQA dataset, allowing the examination of semantically hierarchical layout structures in text-dominant documents. We also present new VRD-QA frameworks designed to grasp textual contents and relations among document layouts simultaneously, extending page-level understanding to the entire multi-page document. Through this work, we aim to enhance the capabilities of existing vision-and-LLMs in handling challenges posed by text-dominant documents in VRD-QA.
- Altclip: Altering the language encoder in clip for extended language capabilities. arXiv preprint arXiv:2211.06679, 2022.
- V-doc: Visual questions answers with documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21492–21498, 2022.
- Form-nlu: Dataset for the form natural language understanding. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2807–2816, 2023.
- Pdf-vqa: A new dataset for real-world vqa on pdf documents. In Gianmarco De Francisci Morales, Claudia Perlich, Natali Ruchansky, Nicolas Kourtellis, Elena Baralis, and Francesco Bonchi, editors, Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track, pages 585–601, Cham, 2023. Springer Nature Switzerland.
- Calm: Commen-sense knowledge augmentation for document image understanding. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3282–3290, 2022.
- Unidoc: Unified pretraining framework for document understanding. Advances in Neural Information Processing Systems, 34:39–50, 2021.
- Fid-light: Efficient and effective retrieval-augmented text generation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1437–1447, 2023.
- Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23369–23379, 2023.
- Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091, 2022.
- Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
- Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer, 2022.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
- Visual instruction tuning, 2023.
- What large language models bring to text-rich vqa? arXiv preprint arXiv:2311.07306, 2023.
- Vision-and-language pretrained models: A survey. In Lud De Raedt, editor, Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 5530–5537. International Joint Conferences on Artificial Intelligence Organization, 7 2022. Survey Track.
- Hrdoc: Dataset and baseline method toward hierarchical reconstruction of document structures. arXiv preprint arXiv:2303.13839, 2023.
- Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Docparser: Hierarchical document structure parsing from renderings. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 4328–4338, 2021.
- Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111, 2019.
- Visualmrc: Machine reading comprehension on document images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13878–13888, 2021.
- Slidevqa: A dataset for document visual question answering on multiple images. arXiv preprint arXiv:2301.04883, 2023.
- Document collection visual question answering. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16, pages 778–792. Springer, 2021.
- Hierarchical multimodal transformers for multipage docvqa. Pattern Recognition, 144:109834, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Lilt: A simple yet effective language-independent layout transformer for structured document understanding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7747–7757, 2022.
- A region-based document vqa. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4909–4920, 2022.
- Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2579–2591, 2021.
- Bridgetower: Building bridges between encoders in vision-language representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10637–10647, 2023.
- Retrieval-augmented multimodal language modeling. arXiv preprint arXiv:2211.12561, 2022.
- Large language models are versatile decomposers: Decomposing evidence and questions for table-based reasoning. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 174–184, 2023.
- Structextv2: Masked visual-textual prediction for document image pre-training. In The Eleventh International Conference on Learning Representations, 2022.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
- Retrieving multimodal information for augmented generation: A survey. arXiv preprint arXiv:2303.10868, 2023.
- Publaynet: largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1015–1022. IEEE, 2019.
- Yihao Ding (16 papers)
- Kaixuan Ren (4 papers)
- Jiabin Huang (6 papers)
- Siwen Luo (14 papers)
- Soyeon Caren Han (48 papers)