Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images (2403.11703v1)

Published 18 Mar 2024 in cs.CV and cs.AI
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Abstract: Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at https://github.com/thunlp/LLaVA-UHD.

LLaVA-UHD: Efficiently Handling Any Aspect Ratio and High-Resolution Images in Large Multimodal Models

Introduction

The capabilities of multimodal understanding, reasoning, and interaction witnessed substantial advancements, which is largely attributed to the integration of visual signals into LLMs. This integration hinges on efficient and adaptive visual encoding strategies. Current Large Multimodal Models (LMMs), however, fall short in efficiently handling images of varying aspect ratios and high resolutions, which is paramount for real-world applications. This paper introduces LLaVA-UHD, a novel LMM that efficiently processes images in any aspect ratio and high resolution. LLaVA-UHD addresses the highlighted shortcomings via an innovative image modularization strategy, a compression module, and a spatial schema for slice organization.

Systematic Flaws in Existing Models

Investigating GPT-4V and LLaVA-1.5, the paper identifies their systematic flaws in visual encoding, particularly in correctly perceiving high-resolution images. The findings underscore a potential vulnerability to adversarial attacks, emphasizing the need for improved visual encoding strategies.

Core Components of LLaVA-UHD

  1. Image Modularization Strategy: This component divides native-resolution images into smaller, variable-sized slices, adapting efficiently to any aspect ratio and resolution. Unlike previous methods relying on fixed aspect ratios, LLaVA-UHD's approach ensures full adaptivity with minimal deviation from the visual encoders' pretraining settings.
  2. Compression Module: To manage the processing demands of high-resolution images, a compression layer further condenses image tokens, reducing the computational load on LLMs.
  3. Spatial Schema: A novel spatial schema organizes slice tokens, providing LLMs with contextual information about slice positions within the image. This aids the model in understanding the global structure of the image from its parts.

Experimental Findings

LLaVA-UHD demonstrates superior performance across nine benchmarks, outstripping existing models trained on significantly larger datasets. Noteworthy improvements include a 6.4 point increase in accuracy on TextVQA and a 3.2 point increase on POPE when comparing to LLaVA-1.5. Moreover, it supports images six times larger in resolution while requiring only 94% of LLaVA-1.5's inference computation.

Practical Implications and Theoretical Significance

LLaVA-UHD's approach contributes to the broader field of AI by offering an efficient solution for processing high-resolution images within LMMs, without sacrificing performance or computational efficiency. The model's adaptability to any aspect ratio and resolution reflects a significant step toward handling real-world images more effectively.

Future Directions

The paper hints at future exploration into encoding higher-resolution images and tasks such as small object detection, emphasizing the need for continued advancement in visual encoding strategies within multimodal systems.

Conclusion

LLaVA-UHD represents a critical advancement in the visual perception capabilities of LMMs. By addressing the fundamental limitations around aspect ratio adaptability and the processing of high-resolution images, the model sets a new benchmark for efficiency and accuracy in multimodal AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Introducing ChatGPT. https://openai.com/blog/chatgpt. 2022.
  2. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  4. VQA: Visual question answering. In IEEE ICCV, pages 2425–2433, 2015.
  5. Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  6. Introducing our multimodal models. adept.ai/blog/fuyu-8b. 2023.
  7. MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023a.
  8. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
  9. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. 2023.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  11. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  12. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, NAACL, pages 4171–4186, 2019.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  14. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  15. VizWiz grand challenge: Answering visual questions from blind people. In IEEE CVPR, pages 3608–3617, 2018.
  16. Deep residual learning for image recognition. In IEEE CVPR, pages 770–778, 2016.
  17. CogAgent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
  18. GQA: A new dataset for real-world visual reasoning and compositional question answering. In IEEE CVPR, pages 6700–6709, 2019.
  19. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123:32–73, 2017.
  20. OtterHD: A high-resolution multi-modality model. arXiv preprint arXiv:2311.04219, 2023a.
  21. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML, 2023b.
  22. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023c.
  23. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023d.
  24. SPHINX: the joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
  25. InfiMM-HD: A leap forward in high-resolution multimodal understanding. arXiv preprint arXiv:2403.01487, 2024a.
  26. LLaVA-NeXT: Improved reasoning, ocr, and world knowledge. https://llava-vl.github.io/blog/2024-01-30-llava-next/. 2024.
  27. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  28. Visual instruction tuning. NeurIPS, 36, 2024b.
  29. MMBench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023b.
  30. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  31. Feast your eyes: Mixture-of-resolution adaptation for multimodal large language models. arXiv preprint arXiv:2403.03003, 2024.
  32. OCR-VQA: Visual question answering by reading text in images. In IEEE ICDAR, pages 947–952, 2019.
  33. DINOv2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  34. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  35. LAION-5B: An open large-scale dataset for training next generation image-text models. NeurIPS, pages 25278–25294, 2022.
  36. Towards VQA models that can read. In IEEE CVPR, pages 8317–8326, 2019.
  37. Aligning large multimodal models with factually augmented RLHF. arXiv preprint arXiv:2309.14525, 2023.
  38. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  39. CogVLM: Visual expert for large language models. 2023.
  40. Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109, 2023.
  41. The dawn of LMMs: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023.
  42. UReader: Universal OCR-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126, 2023a.
  43. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257, 2023b.
  44. RLHF-V: Towards trustworthy MLLMs via behavior alignment from fine-grained correctional human feedback. In CVPR, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Ruyi Xu (4 papers)
  2. Yuan Yao (292 papers)
  3. Zonghao Guo (15 papers)
  4. Junbo Cui (5 papers)
  5. Zanlin Ni (11 papers)
  6. Chunjiang Ge (11 papers)
  7. Tat-Seng Chua (359 papers)
  8. Zhiyuan Liu (433 papers)
  9. Maosong Sun (337 papers)
  10. Gao Huang (178 papers)
Citations (67)