InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding (2403.01487v1)
Abstract: Multimodal LLMs (MLLMs) have experienced significant advancements recently. Nevertheless, challenges persist in the accurate recognition and comprehension of intricate details within high-resolution images. Despite being indispensable for the development of robust MLLMs, this area remains underinvestigated. To tackle this challenge, our work introduces InfiMM-HD, a novel architecture specifically designed for processing images of different resolutions with low computational overhead. This innovation facilitates the enlargement of MLLMs to higher-resolution capabilities. InfiMM-HD incorporates a cross-attention module and visual windows to reduce computation costs. By integrating this architectural design with a four-stage training pipeline, our model attains improved visual perception efficiently and cost-effectively. Empirical study underscores the robustness and effectiveness of InfiMM-HD, opening new avenues for exploration in related areas. Codes and models can be found at https://huggingface.co/Infi-MM/infimm-hd
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale, 2022.
- Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
- Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
- Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4291–4301, 2019.
- Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
- Honeybee: Locality-enhanced projector for multimodal llm, 2023.
- Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023.
- Microsoft coco captions: Data collection and evaluation server, 2015.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- Palm-e: An embodied multimodal language model, 2023.
- Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding, 2023.
- Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023.
- Unified pretraining framework for document understanding, 2022.
- Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3608–3617, 2018.
- Captioning images taken by people who are blind. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pp. 417–434. Springer, 2020.
- InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models. arXiv e-prints, art. arXiv:2311.11567, November 2023. doi: 10.48550/arXiv.2311.11567.
- Infimm-eval: Complex open-ended reasoning evaluation for multi-modal large language models, 2023.
- Coco is ”all” you need for visual instruction fine-tuning, 2024.
- Masked autoencoders are scalable vision learners, 2021.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6700–6709, 2019.
- From clip to dino: Visual encoders shout in multi-modal large language models, 2023.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
- Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023.
- Otterhd: A high-resolution multi-modality model, 2023a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023b.
- Monkey: Image resolution and text label are important things for large multi-modal models, 2023c.
- Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models, 2023.
- Improved baselines with visual instruction tuning, 2023a.
- Visual instruction tuning, 2023b.
- Mmbench: Is your multi-modal model an all-around player?, 2023c.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp. 3195–3204, 2019.
- Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 2200–2209, 2021.
- Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pp. 947–952. IEEE, 2019.
- Learning transferable visual models from natural language supervision, 2021.
- Laion coco: 600m synthetic captions from laion2b-en. https://laion.ai/blog/laion-coco/.
- A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pp. 146–162. Springer, 2022.
- Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 742–758. Springer, 2020.
- Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8317–8326, 2019.
- Eva-clip: Improved training techniques for clip at scale, 2023.
- Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024.
- Cogvlm: Visual expert for pretrained language models, 2023.
- Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning, 2024.
- Translating math formula images to latex sequences using deep neural networks with sequence-level training. International Journal on Document Analysis and Recognition (IJDAR), 24(1-2):63–75, 2021.
- The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023.
- mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023.
- Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2023.
- Halle-switch: Controlling object hallucination in large vision language models, 2023.
- Llavar: Enhanced visual instruction tuning for text-rich image understanding, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023a.
- Multimodal C4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023b.