Dice Question Streamline Icon: https://streamlinehq.com

LVLM Abilities for Long-Context Document Understanding

Establish the capabilities of Large Vision-Language Models (LVLMs) for long-context document understanding by determining whether these models can reliably understand and answer questions over lengthy, multi-page documents.

Information Square Streamline Icon: https://streamlinehq.com

Background

Large Vision-LLMs have achieved strong results on single-page document understanding benchmarks, yet real-world documents often span tens of pages and contain diverse modalities and complex layouts. This creates additional challenges beyond single-page tasks, such as localization across large contexts and cross-page reasoning.

The paper introduces MMLongBench-Doc precisely to evaluate this gap and reports that current LVLMs struggle significantly on long-context document understanding, underscoring the need to rigorously determine and characterize LVLM capabilities in this setting.

References

However, their abilities on long-context DU remain an open problem.

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations (2407.01523 - Ma et al., 1 Jul 2024) in Abstract, page 1