Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GRAM: Global Reasoning for Multi-Page VQA (2401.03411v2)

Published 7 Jan 2024 in cs.CL and cs.CV

Abstract: The increasing use of transformer-based LLMs brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages. We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computationally-heavy pretraining. To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens, facilitating the flow of information across pages for global reasoning. To enforce our model to utilize the newly introduced document tokens, we propose a tailored bias adaptation method. For additional computational savings during decoding, we introduce an optional compression stage using our compression-transformer (C-Former),reducing the encoded sequence length, thereby allowing a tradeoff between quality and latency. Extensive experiments showcase GRAM's state-of-the-art performance on the benchmarks for multi-page DocVQA, demonstrating the effectiveness of our approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Sequence-to-sequence contrastive learning for text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15302–15312, 2021.
  2. Multimodal semi-supervised learning for text recognition. arXiv preprint arXiv:2205.03873, 2022.
  3. Clipter: Looking at the bigger picture in scene text recognition. arXiv preprint arXiv:2301.07464, 2023.
  4. Colt5: Faster long-range transformers with conditional computation. arXiv preprint arXiv:2303.09752, 2023.
  5. Docformer: End-to-end transformer for document understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 993–1003, 2021.
  6. Docformerv2: Local features for document understanding. arXiv preprint arXiv:2306.01733, 2023.
  7. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  8. Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301, 2019.
  9. Latr: Layout-aware transformer for scene-text vqa. pages 16548–16558, 2022.
  10. Scaling transformer to 1m tokens and beyond with rmt. arXiv preprint arXiv:2304.11062, 2023.
  11. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  12. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  13. Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
  14. Clipag: Towards generator-free text-to-image generation. arXiv preprint arXiv:2306.16805, 2023.
  15. Towards models that can see and read. arXiv preprint arXiv:2301.07389, 2023.
  16. Hivt5beam. Hi-VT5 model pretrained with private custom document collection using span masking objective. Pretrained model is then trained with DUDE dataset and Multi-Page DocVQA dataset, 2023.
  17. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091, 2022.
  18. HuggingFace. Huggingface trainer. https://huggingface.co/docs/transformers/training.
  19. Document understanding dataset and evaluation (dude). arXiv preprint arXiv:2305.08455, 2023.
  20. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  21. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  22. Scatter: selective context attentional scene text recognizer. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11962–11972, 2020.
  23. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  24. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  25. Textadain: Paying attention to shortcut learning in text recognizers. In European Conference on Computer Vision, pages 427–445. Springer, 2022.
  26. R OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023.
  27. Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding. arXiv preprint arXiv:2210.06155, 2022.
  28. Going full-tilt boogie on document understanding with text-image-layout transformer. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16, pages 732–747. Springer, 2021.
  29. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  30. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  31. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  32. Docblipvqa. We integrated the prediction outputs from the UDOP model and Blip2 to enhance our results,and we optimized the image encoder and included page number features to address the challenge of multi-page documents. GPT to generate python-like modular programs., 2023a.
  33. Docgptvqa. We integrated the prediction outputs from the UDOP model and Blip2 to enhance our results,and we optimized the image encoder and included page number features to address the challenge of multi-page documents. GPT to generate python-like modular programs., 2023b.
  34. Fusecap: Leveraging large language models to fuse visual data into enriched image captions. arXiv preprint arXiv:2305.17718, 2023.
  35. Unifying vision, text, and layout for universal document processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19254–19264, 2023.
  36. Document collection visual question answering. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16, pages 778–792. Springer, 2021.
  37. Hierarchical multimodal transformers for multi-page docvqa. arXiv preprint arXiv:2212.05935, 2022.
  38. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
  39. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  40. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
  41. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1192–1200, 2020a.
  42. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. arXiv preprint arXiv:2012.14740, 2020b.
  43. Tap: Text-aware pre-training for text-vqa and text-caption. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8751–8761, 2021.
  44. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
  45. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Tsachi Blau (5 papers)
  2. Sharon Fogel (9 papers)
  3. Roi Ronen (6 papers)
  4. Alona Golts (7 papers)
  5. Roy Ganz (19 papers)
  6. Elad Ben Avraham (4 papers)
  7. Aviad Aberdam (16 papers)
  8. Shahar Tsiper (9 papers)
  9. Ron Litman (15 papers)
Citations (8)