Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD (2404.06512v1)

Published 9 Apr 2024 in cs.CV and cs.CL
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Abstract: The Large Vision-LLM (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 x 1500 pixels and constrained to a relatively narrow resolution range. This paper represents InternLM-XComposer2-4KHD, a groundbreaking exploration into elevating LVLM resolution capabilities up to 4K HD (3840 x 1600) and beyond. Concurrently, considering the ultra-high resolution may not be necessary in all scenarios, it supports a wide range of diverse resolutions from 336 pixels to 4K standard, significantly broadening its scope of applicability. Specifically, this research advances the patch division paradigm by introducing a novel extension: dynamic resolution with automatic patch configuration. It maintains the training image aspect ratios while automatically varying patch counts and configuring layouts based on a pre-trained Vision Transformer (ViT) (336 x 336), leading to dynamic training resolution from 336 pixels to 4K standard. Our research demonstrates that scaling training resolution up to 4K HD leads to consistent performance enhancements without hitting the ceiling of potential improvements. InternLM-XComposer2-4KHD shows superb capability that matches or even surpasses GPT-4V and Gemini Pro in 10 of the 16 benchmarks. The InternLM-XComposer2-4KHD model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.

Exploring the Capabilities of InternLM-XComposer2-4KHD in High-Resolution Vision-LLMing

Overview of InternLM-XComposer2-4KHD

InternLM-XComposer2-4KHD represents a significant step forward in the domain of Large Vision-LLMs (LVLMs), aiming to tackle one of the outstanding challenges in the field: the processing and understanding of high-resolution visual content. By extending the capabilities of LVLMs to handle resolutions up to 4K HD (3840 × 1600) and supporting a broad spectrum of resolutions starting from 336 pixels, this paper presents a novel approach to dynamic resolution and automatic patch configuration. This technique not only preserves the aspect ratio of images but also allows for automatic adjustment of patch counts and layouts, based on the resolution requirements dictated by the input image.

Key Contributions and Methodology

The paper outlines several notable contributions and methodological advancements:

  1. Dynamic Resolution and Automatic Patch Configuration: Introduced to handle a wide range of image resolutions effectively. This innovation allows the model to adjust its handling of image patches dynamically, according to the resolution of the input image, thus enabling it to effectively process high-resolution images up to 4K HD.
  2. Training and Performance Improvement with High Resolution: The paper demonstrates that scaling LVLM training to support high-resolution images leads to consistent performance improvements across multiple benchmarks, without reaching a performance saturation point. This suggests potential for future research into even higher resolution processing capabilities.
  3. Evaluation on Diverse Benchmarks: InternLM-XComposer2-4KHD is evaluated across 16 benchmarks, showing superior performance compared to existing models in 10 out of the 16 benchmarks and achieving state-of-the-art results in six of them. Particularly noteworthy is its performance on HD-OCR datasets where it significantly outperforms other models.
  4. Addressing Image 2D Structure Recognition: A novel approach utilizing a learnable newline token is introduced to improve the model's understanding of the 2D structure of images. This is particularly important for accurately processing documents, charts, tables, and infographics that rely on spatial arrangements and structures.

Implications and Future Directions

The research presents both practical and theoretical implications for the field of AI and machine learning:

  • Practical Applicability in Real-World Scenarios: By significantly expanding the resolution capabilities, InternLM-XComposer2-4KHD supports a wider range of practical applications where fine-grained visual content understanding is crucial, including document analysis, content creation, and multimedia processing.
  • Promising Direction for Future Research: The consistent performance improvement observed with increasing training resolutions indicates a promising direction for future research in LVLMs, particularly in exploring the upper limits of resolution enhancements and their impact on model performance.
  • Reconsidering Patch Processing Techniques: The paper suggests that there is merit in revisiting and improving patch processing techniques for high-resolution image understanding. The dynamic resolution and automatic patch configuration approach proposed could inspire new methodologies in handling diverse input resolutions and aspect ratios efficiently.

Conclusion

InternLM-XComposer2-4KHD sets a new precedent in the LVLM domain by addressing the challenging aspect of high-resolution visual content processing. Through its novel approach to dynamic resolution handling and the significant performance improvements demonstrated across a variety of benchmarks, this model opens up new avenues for research and practical applications in the field of generative AI and vision-LLMing. Future studies building on this work may further expand the capabilities of LVLMs, potentially leading to even more sophisticated and versatile models capable of handling an even broader range of visual content with greater accuracy and efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (113)
  1. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019.
  2. Yi: Open foundation models by 01.ai, 2024.
  3. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015.
  4. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv.org, 2023.
  5. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv.org, 2023.
  6. Baichuan. Baichuan 2: Open large-scale language models. arXiv.org, 2023.
  7. Introducing our multimodal models, 2023.
  8. Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301, 2019.
  9. Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33:1877–1901, 2020.
  10. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
  11. DualFocus: Integrating macro and micro perspectives in multi-modal large language models. arXiv preprint arXiv:2402.14767, 2024.
  12. Honeybee: Locality-enhanced projector for multimodal llm. arXiv preprint arXiv:2312.06742, 2023.
  13. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv.org, 2023.
  14. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
  15. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024.
  16. Pali-x: On scaling up a multilingual vision and language model, 2023.
  17. Microsoft coco captions: Data collection and evaluation server, 2015.
  18. Pali-3 vision language models: Smaller, faster, stronger, 2023.
  19. Pali: A jointly-scaled multilingual language-image model, 2023.
  20. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
  21. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  22. Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1571–1576. IEEE, 2019.
  23. Palm: Scaling language modeling with pathways. arXiv.org, 2022.
  24. OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  25. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  26. Visual Dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  27. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024.
  28. Palm-e: An embodied multimodal language model. In arXiv preprint arXiv:2303.03378, 2023.
  29. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
  30. DocPedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding. arXiv preprint arXiv:2311.11810, 2023.
  31. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  32. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023.
  33. Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041, 2023.
  34. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models, 2023.
  35. Wanjuan: A comprehensive multimodal dataset for advancing english and chinese large models. ArXiv, abs/2308.10755, 2023.
  36. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
  37. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. arXiv preprint arXiv:2403.12895, 2024.
  38. Gqa: A new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  39. Mistral 7b, 2023.
  40. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2018.
  41. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  42. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016.
  43. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern recognition, pages 4999–5007, 2017.
  44. Viquae, a dataset for knowledge-based visual question answering about named entities. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3108–3120, 2022.
  45. Seed-bench: Benchmarking multimodal llms with generative comprehension, 2023.
  46. Otterhd: A high-resolution multi-modality model, 2023.
  47. Otter: A multi-modal model with in-context instruction tuning. arXiv.org, 2023.
  48. Mini-Gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024.
  49. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14963–14973, 2023.
  50. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023.
  51. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
  52. Clevr-math: A dataset for compositional language, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022.
  53. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 2023.
  54. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. arXiv preprint arXiv:2311.10774, 2023.
  55. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
  56. Visual instruction tuning. arXiv.org, 2023.
  57. Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023.
  58. On the hidden mystery of ocr in large multimodal models, 2024.
  59. Textmonkey: An ocr-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473, 2024.
  60. RAR: Retrieving and ranking augmented mllms for visual recognition. arXiv preprint arXiv:2403.13805, 2024.
  61. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations (ICLR), 2024.
  62. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. In The 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021.
  63. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  64. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610, 2022.
  65. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021.
  66. Kosmos-2.5: A multimodal literate model, 2023.
  67. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  68. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
  69. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022.
  70. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  71. Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024.
  72. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.
  73. OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2022.
  74. OpenAI. Gpt-4 technical report, 2023.
  75. Im2text: Describing images using 1 million captioned photographs. In Neural Information Processing Systems (NIPS), 2011.
  76. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems (NeurIPS), 35:27730–27744, 2022.
  77. Kosmos-2: Grounding multimodal large language models to the world. arXiv.org, 2023.
  78. Qwen. Introducing qwen-7b: Open foundation and human-aligned models (of the state-of-the-arts), 2023.
  79. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine learning (ICML), pages 8748–8763. PMLR, 2021.
  80. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  81. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022.
  82. Kvqa: Knowledge-aware visual question answering. In Proceedings of the AAAI conference on artificial intelligence, 2019.
  83. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  84. Icdar2017 competition on reading chinese text in the wild (rctw-17). In 2017 14th iapr international conference on document analysis and recognition (ICDAR), volume 1, pages 1429–1434. IEEE, 2017.
  85. Design2code: How far are we from automating front-end engineering?, 2024.
  86. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020.
  87. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  88. Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1557–1562. IEEE, 2019.
  89. Alpha-CLIP: A clip model focusing on wherever you want. arXiv preprint arXiv:2312.03818, 2023.
  90. Gemini Team. Gemini: A family of highly capable multimodal models, 2023.
  91. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
  92. Llama: Open and efficient foundation language models. arXiv.org, 2023.
  93. Llama 2: Open foundation and fine-tuned chat models, 2023.
  94. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023.
  95. Cogvlm: Visual expert for pretrained language models, 2023.
  96. Towards improving document understanding: An exploration on text-grounding via mllms. arXiv preprint arXiv:2311.13194, 2023.
  97. Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109, 2023.
  98. Q-bench: A benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181, 2023.
  99. Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images. arXiv preprint arXiv:2403.11703, 2024.
  100. mPLUG-DocOwl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023.
  101. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126, 2023.
  102. mplug-owl: Modularization empowers large language models with multimodality. arXiv.org, 2023.
  103. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  104. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  105. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  106. A large chinese text dataset in the wild. Journal of Computer Science and Technology, 34(3):509–521, 2019.
  107. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
  108. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR), 2023.
  109. Long-CLIP: Unlocking the long-text capability of clip. arXiv preprint arXiv:2403.15378, 2024.
  110. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023.
  111. Icdar 2019 robust reading challenge on reading chinese text on signboard. In 2019 international conference on document analysis and recognition (ICDAR), pages 1577–1581. IEEE, 2019.
  112. LLaVAR: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023.
  113. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv.org, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (24)
  1. Xiaoyi Dong (73 papers)
  2. Pan Zhang (153 papers)
  3. Yuhang Zang (54 papers)
  4. Yuhang Cao (41 papers)
  5. Bin Wang (750 papers)
  6. Linke Ouyang (12 papers)
  7. Songyang Zhang (116 papers)
  8. Haodong Duan (55 papers)
  9. Wenwei Zhang (77 papers)
  10. Yining Li (29 papers)
  11. Hang Yan (86 papers)
  12. Yang Gao (761 papers)
  13. Zhe Chen (237 papers)
  14. Xinyue Zhang (63 papers)
  15. Wei Li (1121 papers)
  16. Jingwen Li (29 papers)
  17. Wenhai Wang (123 papers)
  18. Kai Chen (512 papers)
  19. Conghui He (114 papers)
  20. Xingcheng Zhang (29 papers)
Citations (82)