Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest (2307.03601v3)

Published 7 Jul 2023 in cs.CV
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

Abstract: Visual instruction tuning LLM(LLM) on image-text pairs has achieved general-purpose vision-language abilities. However, the lack of region-text pairs limits their advancements to fine-grained multimodal understanding. In this paper, we propose spatial instruction tuning, which introduces the reference to the region-of-interest(RoI) in the instruction. Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence. Our model GPT4RoI, trained on 7 region-text pair datasets, brings an unprecedented interactive and conversational experience compared to previous image-level models. (1) Interaction beyond language: Users can interact with our model by both language and drawing bounding boxes to flexibly adjust the referring granularity. (2) Versatile multimodal abilities: A variety of attribute information within each RoI can be mined by GPT4RoI, e.g., color, shape, material, action, etc. Furthermore, it can reason about multiple RoIs based on common sense. On the Visual Commonsense Reasoning(VCR) dataset, GPT4RoI achieves a remarkable accuracy of 81.6%, surpassing all existing models by a significant margin (the second place is 75.6%) and almost reaching human-level performance of 85.0%. The code, dataset, and demo can be found at https://github.com/jshilong/GPT4RoI.

Insights on "GPT4RoI: Instruction Tuning LLM on Region-of-Interest"

The paper "GPT4RoI: Instruction Tuning LLM on Region-of-Interest" introduces an advanced model for fine-grained multimodal understanding by enhancing LLMs with spatial instruction tuning. This research highlights a novel approach in efficiently aligning region-level visual features with language embeddings, stepping beyond the conventional image-text pair paradigm.

Model Architecture and Methodology

GPT4RoI utilizes state-of-the-art architecture components, including a CLIP-based vision encoder and the Vicuna model for language understanding. A significant contribution is the extraction and integration of region-level features using RoIAlign fused with multi-level features. This design allows the model to convert user instructions into spatial representations by replacing designated region tokens with their respective features.

The architecture supports interaction beyond traditional language-based inputs, allowing users to specify regions of interest directly, thereby improving communication granularity. This capacity is explored across various datasets, focusing on enhancing the model's ability to understand and reason about specific regions within images.

Training and Dataset Utilization

Two training stages are employed: an initial alignment of simple region-text pairs followed by fine-tuning with complex concepts and reasoning tasks. The datasets employed, such as COCO and Visual Genome, are transformed into a spatial instruction format to accommodate user interactions at a region level. Further, incorporating LLaVA150K enhances multi-round dialogue capabilities.

During training, the model undergoes next-token prediction loss adjustments, aligning region features with the linguistic context. This approach enhances the model's proficiency in understanding fine-grained details beyond mere category recognition.

Results and Findings

Empirical results demonstrate that GPT4RoI delivers outstanding performance in tasks requiring intricate understanding, such as Visual Commonsense Reasoning (VCR), where it achieves near-human performance levels. The model significantly surpasses existing benchmarks in region captioning and reasoning tasks, showcasing its robust comprehension and reasoning capabilities.

Implications and Future Work

The research has substantial ramifications for developing more interactive and precise AI systems in multimodal understanding. The ability to reference and reason about specific regions opens up possibilities for applications that necessitate detailed visual understanding, paving the way for more intuitive AI interactions.

Future developments could focus on expanding region-level datasets and refining model architectures to further enhance performance. Exploring semi-supervised techniques for generating region-level data and developing diverse interaction modes could enable a more comprehensive understanding of visual content.

Overall, GPT4RoI marks a significant advancement in the domain of multimodal AI, advocating for a seamless integration of spatial and linguistic data processing capabilities. This work sets the stage for future exploration and refinement in vision-LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Anthropic. Claude. https://www.anthropic.com/index/introducing-claude, 2023.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  5. End-to-end object detection with transformers. In European Conference on Computer Vision, pp.  213–229. Springer, 2020.
  6. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3558–3568, 2021.
  7. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  8. Uniter: Universal image-text representation learning, 2020.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  10. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  11. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  12. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  14. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  15. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  320–335, 2022.
  16. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.
  17. Large-scale adversarial training for vision-and-language representation learning, 2020.
  18. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15180–15190, 2023.
  19. Multimodal-gpt: A vision and language model for dialogue with humans, 2023.
  20. Google. Bard. https://bard.google.com/, 2023.
  21. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp.  2961–2969, 2017.
  22. Modeling relationships in referential expressions with compositional modular networks, 2016.
  23. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  24. Densecap: Fully convolutional localization networks for dense captioning, 2015.
  25. Mdetr – modulated detection for end-to-end multi-modal understanding, 2021.
  26. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  27. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training, 2019a.
  28. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
  29. Visualbert: A simple and performant baseline for vision and language, 2019b.
  30. Grounded language-image pre-training. In CVPR, 2022.
  31. Scene graph generation from objects, phrases and region captions, 2017.
  32. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014.
  33. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  34. An intriguing failing of convolutional neural networks and the coordconv solution, 2018.
  35. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2023b.
  36. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023c.
  37. Internchat: Solving vision-centric tasks by interacting with chatbots beyond language. arXiv preprint arXiv:2305.05662, 2023d.
  38. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, 2019.
  39. 12-in-1: Multi-task vision and language representation learning, 2020.
  40. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  11–20, 2016.
  41. Embodiedgpt: Vision-language pre-training via embodied chain of thought. arXiv preprint arXiv:2305.15021, 2023.
  42. OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2022.
  43. OpenAI. Gpt-4 technical report, 2023.
  44. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011.
  45. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  46. Detgpt: Detect what you need via reasoning. arXiv preprint arXiv:2305.14167, 2023.
  47. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp.  2641–2649, 2015.
  48. Improving language understanding by generative pre-training. OpenAI, 2018.
  49. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  50. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  51. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28:91–99, 2015.
  52. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  53. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2556–2565, 2018.
  54. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
  55. Moss. https://github.com/OpenLMLab/MOSS, 2022.
  56. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
  57. Learning to compose dynamic tree structures for visual contexts, 2018.
  58. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  59. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  60. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  61. Caption anything: Interactive image description with diverse multimodal controls. arXiv preprint arXiv:2305.02677, 2023a.
  62. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023b.
  63. Vqa-gnn: Reasoning with multimodal semantic graph for visual question answering, 2022.
  64. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  65. Grit: A generative region-to-text transformer for object understanding, 2022.
  66. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models, 2023.
  67. Pave the way to grasp anything: Transferring foundation models for universal pick-place robots. arXiv preprint arXiv:2306.05716, 2023a.
  68. Panoptic scene graph generation, 2022.
  69. Dense captioning with joint inference and visual context. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.
  70. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023b.
  71. Pevl: Position-enhanced pre-training and prompt tuning for vision-language models. arXiv preprint arXiv:2205.11169, 2022.
  72. Ernie-vil: Knowledge enhanced vision-language representations through scene graph, 2021.
  73. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp.  69–85. Springer, 2016.
  74. Contextual object detection with multimodal large language models, 2023.
  75. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6720–6731, 2019a.
  76. From recognition to cognition: Visual commonsense reasoning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019b.
  77. Merlot: Multimodal neural script knowledge models. In Advances in Neural Information Processing Systems 34, 2021.
  78. Transfer visual prompt generator across llms. arXiv preprint arXiv:2305.01278, 2023a.
  79. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
  80. Dense distinct query for end-to-end object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  7329–7338, June 2023c.
  81. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  82. Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
  83. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  84. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
  85. Visual7w: Grounded question answering in images, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Shilong Zhang (32 papers)
  2. Peize Sun (33 papers)
  3. Shoufa Chen (22 papers)
  4. Min Xiao (103 papers)
  5. Wenqi Shao (89 papers)
  6. Wenwei Zhang (77 papers)
  7. Yu Liu (784 papers)
  8. Kai Chen (512 papers)
  9. Ping Luo (340 papers)
Citations (189)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com