Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Focus Anywhere for Fine-grained Multi-page Document Understanding (2405.14295v1)

Published 23 May 2024 in cs.CV

Abstract: Modern LVLMs still struggle to achieve fine-grained document understanding, such as OCR/translation/caption for regions of interest to the user, tasks that require the context of the entire page, or even multiple pages. Accordingly, this paper proposes Fox, an effective pipeline, hybrid data, and tuning strategy, that catalyzes LVLMs to focus anywhere on single/multi-page documents. We introduce a novel task to boost the document understanding by making LVLMs focus attention on the document-level region, such as redefining full-page OCR as foreground focus. We employ multiple vision vocabularies to extract visual hybrid knowledge for interleaved document pages (e.g., a page containing a photo). Meanwhile, we render cross-vocabulary vision data as the catalyzer to achieve a full reaction of multiple visual vocabularies and in-document figure understanding. Further, without modifying the weights of multiple vision vocabularies, the above catalyzed fine-grained understanding capabilities can be efficiently tuned to multi-page documents, enabling the model to focus anywhere in both format-free and page-free manners. Besides, we build a benchmark including 9 fine-grained sub-tasks (e.g., region-level OCR/summary, color-guided OCR) to promote document analysis in the community. The experimental results verify the superiority of our model.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Chenglong Liu (11 papers)
  2. Haoran Wei (55 papers)
  3. Jinyue Chen (5 papers)
  4. Lingyu Kong (13 papers)
  5. Zheng Ge (60 papers)
  6. Zining Zhu (41 papers)
  7. Liang Zhao (353 papers)
  8. Jianjian Sun (23 papers)
  9. Chunrui Han (21 papers)
  10. Xiangyu Zhang (328 papers)
Citations (10)