GRAM: Global Reasoning for Multi-Page VQA (2401.03411v2)

Published 7 Jan 2024 in cs.CL and cs.CV

Abstract: The increasing use of transformer-based LLMs brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages. We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computationally-heavy pretraining. To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens, facilitating the flow of information across pages for global reasoning. To enforce our model to utilize the newly introduced document tokens, we propose a tailored bias adaptation method. For additional computational savings during decoding, we introduce an optional compression stage using our compression-transformer (C-Former),reducing the encoded sequence length, thereby allowing a tradeoff between quality and latency. Extensive experiments showcase GRAM's state-of-the-art performance on the benchmarks for multi-page DocVQA, demonstrating the effectiveness of our approach.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (45)

Authors (9)

Tsachi Blau (5 papers)
Sharon Fogel (9 papers)
Roi Ronen (6 papers)
Alona Golts (7 papers)
Roy Ganz (19 papers)
Elad Ben Avraham (4 papers)
Aviad Aberdam (16 papers)
Shahar Tsiper (9 papers)
Ron Litman (15 papers)

Citations (8)

View on Semantic Scholar

GRAM: Global Reasoning for Multi-Page VQA (2401.03411v2)

Related Papers