Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding (2403.01487v1)

Published 3 Mar 2024 in cs.CV

Abstract: Multimodal LLMs (MLLMs) have experienced significant advancements recently. Nevertheless, challenges persist in the accurate recognition and comprehension of intricate details within high-resolution images. Despite being indispensable for the development of robust MLLMs, this area remains underinvestigated. To tackle this challenge, our work introduces InfiMM-HD, a novel architecture specifically designed for processing images of different resolutions with low computational overhead. This innovation facilitates the enlargement of MLLMs to higher-resolution capabilities. InfiMM-HD incorporates a cross-attention module and visual windows to reduce computation costs. By integrating this architectural design with a four-stage training pipeline, our model attains improved visual perception efficiently and cost-effectively. Empirical study underscores the robustness and effectiveness of InfiMM-HD, opening new avenues for exploration in related areas. Codes and models can be found at https://huggingface.co/Infi-MM/infimm-hd

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale, 2022.
  3. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015.
  4. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
  5. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
  6. Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  4291–4301, 2019.
  7. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  8. Honeybee: Locality-enhanced projector for multimodal llm, 2023.
  9. Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023.
  10. Microsoft coco captions: Data collection and evaluation server, 2015.
  11. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  12. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  13. Palm-e: An embodied multimodal language model, 2023.
  14. Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding, 2023.
  15. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023.
  16. Unified pretraining framework for document understanding, 2022.
  17. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3608–3617, 2018.
  18. Captioning images taken by people who are blind. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pp.  417–434. Springer, 2020.
  19. InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models. arXiv e-prints, art. arXiv:2311.11567, November 2023. doi: 10.48550/arXiv.2311.11567.
  20. Infimm-eval: Complex open-ended reasoning evaluation for multi-modal large language models, 2023.
  21. Coco is ”all” you need for visual instruction fine-tuning, 2024.
  22. Masked autoencoders are scalable vision learners, 2021.
  23. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6700–6709, 2019.
  24. From clip to dino: Visual encoders shout in multi-modal large language models, 2023.
  25. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  26. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023.
  27. Otterhd: A high-resolution multi-modality model, 2023a.
  28. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023b.
  29. Monkey: Image resolution and text label are important things for large multi-modal models, 2023c.
  30. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models, 2023.
  31. Improved baselines with visual instruction tuning, 2023a.
  32. Visual instruction tuning, 2023b.
  33. Mmbench: Is your multi-modal model an all-around player?, 2023c.
  34. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
  35. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp.  3195–3204, 2019.
  36. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.  2200–2209, 2021.
  37. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pp.  947–952. IEEE, 2019.
  38. Learning transferable visual models from natural language supervision, 2021.
  39. Laion coco: 600m synthetic captions from laion2b-en. https://laion.ai/blog/laion-coco/.
  40. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pp.  146–162. Springer, 2022.
  41. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp.  742–758. Springer, 2020.
  42. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8317–8326, 2019.
  43. Eva-clip: Improved training techniques for clip at scale, 2023.
  44. Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024.
  45. Cogvlm: Visual expert for pretrained language models, 2023.
  46. Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning, 2024.
  47. Translating math formula images to latex sequences using deep neural networks with sequence-level training. International Journal on Document Analysis and Recognition (IJDAR), 24(1-2):63–75, 2021.
  48. The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023.
  49. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023.
  50. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
  51. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2023.
  52. Halle-switch: Controlling object hallucination in large vision language models, 2023.
  53. Llavar: Enhanced visual instruction tuning for text-rich image understanding, 2023.
  54. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023a.
  55. Multimodal C4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023b.
Citations (7)

Summary

  • The paper introduces InfiMM-HD, which significantly advances multimodal models by efficiently integrating high-resolution image processing using a novel cross-attention module and visual windows.
  • The paper details a four-stage training pipeline that gradually aligns vision and language features while dynamically adapting to varying image resolutions.
  • Empirical results show InfiMM-HD’s superior performance in tasks such as TextVQA and DocVQA, paving the way for advanced applications in detailed visual analysis.

InfiMM-HD: Enhancing Multimodal LLMs with High-Resolution Image Processing

Introduction to InfiMM-HD

The domain of Multimodal LLMs (MLLMs) has witnessed significant strides, particularly in integrating visual cues with textual understanding. However, a notable gap persists in these models' ability to parse and comprehend high-resolution images, which are crucial for a wide range of applications requiring detailed visual insights. Addressing this issue, the paper introduces InfiMM-HD, an innovative MLLM architecture tailored for processing images across various resolutions, with a particular emphasis on high-resolution imagery. This model stands out due to its meticulous design, incorporating a cross-attention module and visual windows for efficient computation, ensuring low overhead even when handling more elaborate visual data.

Key Contributions and Architectural Innovations

InfiMM-HD's Novel Architecture

  • Cross-Attention Module: At the heart of InfiMM-HD is the use of a cross-attention mechanism, which plays a pivotal role in seamlessly integrating visual and textual modalities. Unlike prior approaches relying heavily on MLPs (Multi-Layer Perceptrons) for token transformation and alignment, this model adopts an architecture that balances computational efficiency with the richness of information processing.
  • Visual Windows for Computational Efficiency: To counter the rapidly escalating computation costs associated with processing higher-resolution images, InfiMM-HD leverages visual windows. This strategic partition of images into sub-images, paired with shared Vision Transformer (ViT) processing, marks a significant step forward in efficiently managing high-resolution inputs.

Four-Stage Training Pipeline

A distinguishing feature of InfiMM-HD is its meticulously crafted four-stage training pipeline, designed to gradually enhance the model's proficiency in high-resolution image handling:

  1. Pretraining with Image Resolution Upscaling: The initial stages focus on aligning vision and language features using standard resolutions, gradually moving to higher resolutions.
  2. Knowledge Injection and Alignment with Cross-Attention Module Training: Subsequent stages involve the training of the cross-attention mechanism, further refining the model's capability to integrate detailed visual information.
  3. Dynamic Resolution Adaptation for High-Resolution Handling: A key innovation here is the ability to adaptively process a range of resolutions, significantly reducing the training and computational costs.
  4. Visual Instruction Fine-Tuning: The final phase of training sharpens the model's ability to follow visual instructions precisely, enhancing its applicability across various tasks.

Empirical Evaluation and Implications

InfiMM-HD exhibits superior performance across multiple benchmarks, showcasing its ability to process high-resolution images without compromising efficiency or accuracy. The model's structure and training methodology present a promising avenue for future research in enhancing MLLMs for detailed visual perception.

  • The extensive empirical paper underscores the robustness of InfiMM-HD, particularly in its adeptness at fine-grained visual perception, as demonstrated through superior results in downstream tasks like TextVQA and DocVQA.

Theoretical and Practical Implications

The introduction of InfiMM-HD not only addresses a crucial gap in MLLM capabilities but also sets a new direction for future research in the field. On a theoretical level, it proposes an effective architecture and training scheme for integrating high-resolution images in MLLMs, expanding our understanding of multimodal learning.

Practically, the model's enhanced visual perception capabilities open new possibilities in applications requiring detailed image analysis, from medical imaging to surveillance and beyond. Additionally, InfiMM-HD's efficient computation model makes high-resolution image processing more accessible, potentially broadening the scope for real-world applications of MLLMs.

Concluding Remarks

InfiMM-HD represents a significant leap forward in the field of MLLMs, combining high-resolution image processing capabilities with computational efficiency. Its innovative architecture, coupled with a strategic training pipeline, offers a practical solution to the challenges of integrating detailed visual data into multimodal learning models. As such, InfiMM-HD not only advances the field of MLLMs but also lays the groundwork for future explorations into more sophisticated and efficient multimodal learning systems.