GRAM: Global Reasoning for Multi-Page VQA (2401.03411v2)
Abstract: The increasing use of transformer-based LLMs brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages. We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computationally-heavy pretraining. To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens, facilitating the flow of information across pages for global reasoning. To enforce our model to utilize the newly introduced document tokens, we propose a tailored bias adaptation method. For additional computational savings during decoding, we introduce an optional compression stage using our compression-transformer (C-Former),reducing the encoded sequence length, thereby allowing a tradeoff between quality and latency. Extensive experiments showcase GRAM's state-of-the-art performance on the benchmarks for multi-page DocVQA, demonstrating the effectiveness of our approach.
- Sequence-to-sequence contrastive learning for text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15302–15312, 2021.
- Multimodal semi-supervised learning for text recognition. arXiv preprint arXiv:2205.03873, 2022.
- Clipter: Looking at the bigger picture in scene text recognition. arXiv preprint arXiv:2301.07464, 2023.
- Colt5: Faster long-range transformers with conditional computation. arXiv preprint arXiv:2303.09752, 2023.
- Docformer: End-to-end transformer for document understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 993–1003, 2021.
- Docformerv2: Local features for document understanding. arXiv preprint arXiv:2306.01733, 2023.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301, 2019.
- Latr: Layout-aware transformer for scene-text vqa. pages 16548–16558, 2022.
- Scaling transformer to 1m tokens and beyond with rmt. arXiv preprint arXiv:2304.11062, 2023.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
- Clipag: Towards generator-free text-to-image generation. arXiv preprint arXiv:2306.16805, 2023.
- Towards models that can see and read. arXiv preprint arXiv:2301.07389, 2023.
- Hivt5beam. Hi-VT5 model pretrained with private custom document collection using span masking objective. Pretrained model is then trained with DUDE dataset and Multi-Page DocVQA dataset, 2023.
- Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091, 2022.
- HuggingFace. Huggingface trainer. https://huggingface.co/docs/transformers/training.
- Document understanding dataset and evaluation (dude). arXiv preprint arXiv:2305.08455, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Scatter: selective context attentional scene text recognizer. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11962–11972, 2020.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
- Textadain: Paying attention to shortcut learning in text recognizers. In European Conference on Computer Vision, pages 427–445. Springer, 2022.
- R OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023.
- Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding. arXiv preprint arXiv:2210.06155, 2022.
- Going full-tilt boogie on document understanding with text-image-layout transformer. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16, pages 732–747. Springer, 2021.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Docblipvqa. We integrated the prediction outputs from the UDOP model and Blip2 to enhance our results,and we optimized the image encoder and included page number features to address the challenge of multi-page documents. GPT to generate python-like modular programs., 2023a.
- Docgptvqa. We integrated the prediction outputs from the UDOP model and Blip2 to enhance our results,and we optimized the image encoder and included page number features to address the challenge of multi-page documents. GPT to generate python-like modular programs., 2023b.
- Fusecap: Leveraging large language models to fuse visual data into enriched image captions. arXiv preprint arXiv:2305.17718, 2023.
- Unifying vision, text, and layout for universal document processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19254–19264, 2023.
- Document collection visual question answering. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16, pages 778–792. Springer, 2021.
- Hierarchical multimodal transformers for multi-page docvqa. arXiv preprint arXiv:2212.05935, 2022.
- Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
- Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1192–1200, 2020a.
- Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. arXiv preprint arXiv:2012.14740, 2020b.
- Tap: Text-aware pre-training for text-vqa and text-caption. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8751–8761, 2021.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
- Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133, 2022.
- Tsachi Blau (5 papers)
- Sharon Fogel (9 papers)
- Roi Ronen (6 papers)
- Alona Golts (7 papers)
- Roy Ganz (19 papers)
- Elad Ben Avraham (4 papers)
- Aviad Aberdam (16 papers)
- Shahar Tsiper (9 papers)
- Ron Litman (15 papers)