Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models (2404.09204v1)

Published 14 Apr 2024 in cs.CV and cs.AI
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

Abstract: Multimodal LLMs (MLLMs) have shown impressive results on various multimodal tasks. However, most existing MLLMs are not well suited for document-oriented tasks, which require fine-grained image perception and information compression. In this paper, we present TextHawk, a MLLM that is specifically designed for document-oriented tasks, while preserving the general capabilities of MLLMs. TextHawk is aimed to explore efficient fine-grained perception by designing four dedicated components. Firstly, a ReSampling and ReArrangement (ReSA) module is proposed to reduce the redundancy in the document texts and lower the computational cost of the MLLM. We explore encoding the positions of each local feature by presenting Scalable Positional Embeddings (SPEs), which can preserve the scalability of various image sizes. A Query Proposal Network (QPN) is then adopted to initialize the queries dynamically among different sub-images. To further enhance the fine-grained visual perceptual ability of the MLLM, we design a Multi-Level Cross-Attention (MLCA) mechanism that captures the hierarchical structure and semantic relations of document images. Furthermore, we create a new instruction-tuning dataset for document-oriented tasks by enriching the multimodal document data with Gemini Pro. We conduct extensive experiments on both general and document-oriented MLLM benchmarks, and show that TextHawk outperforms the state-of-the-art methods, demonstrating its effectiveness and superiority in fine-grained document perception and general abilities.

TextHawk: Advancements in Multimodal LLMs for Document-Oriented Tasks

Introduction

The field of Multimodal LLMs (MLLMs) has significantly advanced with the advent of models capable of understanding and generating information across various modalities, notably visual and textual. Among these, document-oriented tasks stand out due to their complex nature, involving high-resolution images densely packed with information. The challenge lies in achieving fine-grained visual perception and efficient document image information compression. TextHawk emerges as a specialized MLLM, focusing on these challenges while maintaining robust general capabilities across vision and language domains.

Document-Oriented MLLMs and Their Limitations

Traditional MLLMs have ventured into enhancing fine-grained visual perception and information compression, employing methods such as increased input resolution and vision-language adapters. However, these approaches often fall short in striking a balance between general and document-specific capabilities, leaving a gap for further exploration.

TextHawk: Core Components and Innovations

TextHawk introduces four pivotal components designed to address the nuanced demands of document-oriented tasks:

  • ReSampling and ReArrangement (ReSA): A module that significantly compresses visual information, reducing the number of visual tokens required for document images, thus lowering computational costs.
  • Scalable Positional Embeddings (SPEs): Designed to encode the positions of sub-images efficiently, SPEs facilitate handling varying image sizes without losing scalability.
  • Query Proposal Network (QPN): This component dynamically initializes queries among different sub-images, addressing the variability inherent in document images.
  • Multi-Level Cross-Attention (MLCA): Enhances fine-grained visual perception by leveraging the hierarchical structure and semantic relations within document images.

Additionally, TextHawk is enriched with a novel instruction-tuning dataset tailored for document-oriented tasks, complementing its architecture designed for fine-grained perception and information compression.

Empirical Validation

TextHawk has been rigorously evaluated against both general and document-oriented MLLM benchmarks. It has demonstrated superior performance, outperforming state-of-the-art methods, substantiating its effectiveness in fine-grained document perception alongside maintaining general vision-language capabilities.

Ablation Studies and Insights

A series of ablation studies shed light on the contributions of TextHawk’s individual components:

  • The combination of ReSA's components leads to significant reductions in visual tokens, enabling more efficient processing of high-resolution document images.
  • SPEs and QPN collectively contribute to the model’s enhanced perception capabilities, accommodating the diversity and complexity of document-oriented tasks.
  • MLCA's ability to leverage multi-level features results in improved fine-grained perception, an essential attribute for document image understanding.

Limitations and Future Directions

While TextHawk marks a notable advancement, the freezing of the visual encoder during training points to potential areas for further exploration. Future work could involve adaptively training the vision encoder on task-specific data to refine and expand its perception capabilities.

Conclusion

TextHawk represents a significant leap forward in the specialized domain of document-oriented MLLMs. By addressing the intricate challenges of fine-grained visual perception and efficient information compression, TextHawk sets a new benchmark for future developments in the field. Its state-of-the-art performance across a wide range of benchmarks underscores its potential to pave the way for advanced document image understanding applications, bridging the gap between multimodal LLMs and the nuanced requirements of document-oriented tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Transformers and Language Models in Form Understanding: A Comprehensive Review of Scanned Document Analysis. CoRR abs/2403.04080 (2024).
  2. Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities. CoRR abs/2308.12966 (2023).
  3. Nougat: Neural Optical Understanding for Academic Documents. CoRR abs/2308.13418 (2023).
  4. End-to-End Object Detection with Transformers. In Eur. Conf. Comput. Vis., Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.), Vol. 12346. 213–229.
  5. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. In IEEE Conf. Comput. Vis. Pattern Recog. 3558–3568.
  6. ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model. CoRR abs/2402.11684 (2024).
  7. Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. CoRR abs/2306.15195 (2023).
  8. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions. CoRR abs/2311.12793 (2023).
  9. TabFact: A Large-scale Dataset for Table-based Fact Verification. In Int. Conf. Learn. Represent.
  10. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. CoRR abs/2305.06500 (2023).
  11. UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding. CoRR abs/2308.11592 (2023).
  12. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. CoRR abs/2306.13394 (2023).
  13. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. Int. J. Comput. Vis. 127, 4 (2019), 398–414.
  14. Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In IEEE Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, 6700–6709.
  15. From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models. CoRR abs/2310.08825 (2023).
  16. ReferItGame: Referring to Objects in Photographs of Natural Scenes. In Proc. EMNLP, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). 787–798.
  17. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis. 123, 1 (2017), 32–73.
  18. Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. In Proc. ICML, Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.), Vol. 202. 18893–18912.
  19. Building a test collection for complex document information processing. In ACM Int. Conf. Multimedia, Efthimis N. Efthimiadis, Susan T. Dumais, David Hawking, and Kalervo Järvelin (Eds.). 665–666.
  20. SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension. CoRR abs/2307.16125 (2023).
  21. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proc. ICML, Vol. 202. 19730–19742.
  22. Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models. In IEEE Conf. Comput. Vis. Pattern Recog.
  23. Feature Pyramid Networks for Object Detection. In IEEE Conf. Comput. Vis. Pattern Recog. 936–944.
  24. Improved Baselines with Visual Instruction Tuning. CoRR abs/2310.03744 (2023).
  25. Visual Instruction Tuning. In Adv. Neural Inform. Process. Syst.
  26. MMBench: Is Your Multi-modal Model an All-around Player? CoRR abs/2307.06281 (2023).
  27. TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document. CoRR abs/2403.04473 (2024).
  28. Point and Ask: Incorporating Pointing into Visual Question Answering. CoRR abs/2011.13681 (2020).
  29. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. In IEEE Conf. Comput. Vis. Pattern Recog. 3195–3204.
  30. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. In Proc. ACL, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). 2263–2279.
  31. InfographicVQA. In Proc. WACV. 2582–2591.
  32. DocVQA: A Dataset for VQA on Document Images. In Proc. WACV. 2199–2208.
  33. OCR-VQA: Visual Question Answering by Reading Text in Images. In Proc. ICDAR. 947–952.
  34. Im2Text: Describing Images Using 1 Million Captioned Photographs. In Adv. Neural Inform. Process. Syst., John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger (Eds.). 1143–1151.
  35. Panupong Pasupat and Percy Liang. 2015. Compositional Semantic Parsing on Semi-Structured Tables. In Proc. ACL. 1470–1480.
  36. Kosmos-2: Grounding Multimodal Large Language Models to the World. CoRR abs/2306.14824 (2023).
  37. Learning Transferable Visual Models From Natural Language Supervision. In Proc. ICML, Marina Meila and Tong Zhang (Eds.), Vol. 139. 8748–8763.
  38. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. CoRR abs/2111.02114 (2021).
  39. A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge. In Eur. Conf. Comput. Vis., Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.), Vol. 13668. 146–162.
  40. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proc. ACL, Iryna Gurevych and Yusuke Miyao (Eds.). 2556–2565.
  41. TextCaps: A Dataset for Image Captioning with Reading Comprehension. In Eur. Conf. Comput. Vis., Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.), Vol. 12347. 742–758.
  42. SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images. In AAAI, Brian Williams, Yiling Chen, and Jennifer Neville (Eds.). 13636–13645.
  43. VisualMRC: Machine Reading Comprehension on Document Images. In AAAI. AAAI Press, 13878–13888.
  44. InternLM Team. 2023. InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities. https://github.com/InternLM/InternLM.
  45. LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971 (2023).
  46. mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding. CoRR abs/2307.02499 (2023).
  47. UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model. In Proc. EMNLP, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). 2841–2858.
  48. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. CoRR abs/2304.14178 (2023).
  49. mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. CoRR abs/2311.04257 (2023).
  50. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguistics 2 (2014), 67–78.
  51. Sigmoid Loss for Language Image Pre-Training. In Int. Conf. Comput. Vis. 11941–11952.
  52. InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition. CoRR abs/2309.15112 (2023).
  53. SVIT: Scaling up Visual Instruction Tuning. CoRR abs/2307.04087 (2023).
  54. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. In Int. Conf. Learn. Represent.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ya-Qi Yu (3 papers)
  2. Minghui Liao (29 papers)
  3. Jihao Wu (10 papers)
  4. Yongxin Liao (1 paper)
  5. Xiaoyu Zheng (29 papers)
  6. Wei Zeng (95 papers)
Citations (11)