Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Osprey: Pixel Understanding with Visual Instruction Tuning (2312.10032v3)

Published 15 Dec 2023 in cs.CV
Osprey: Pixel Understanding with Visual Instruction Tuning

Abstract: Multimodal LLMs (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However, current MLLMs primarily focus on image-level or box-level understanding, falling short in achieving fine-grained vision-language alignment at pixel level. Besides, the lack of mask-based instruction data limits their advancements. In this paper, we propose Osprey, a mask-text instruction tuning approach, to extend MLLMs by incorporating fine-grained mask regions into language instruction, aiming at achieving pixel-wise visual understanding. To achieve this goal, we first meticulously curate a mask-based region-text dataset with 724K samples, and then design a vision-LLM by injecting pixel-level representation into LLM. Specifically, Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey's superiority in various region understanding tasks, showcasing its new capability for pixel-level instruction tuning. In particular, Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics. The source code, dataset and demo can be found at https://github.com/CircleRadon/Osprey.

Analyzing Osprey: Pixel Understanding with Visual Instruction Tuning

Osprey presents a significant advancement in the field of multimodal LLMs (MLLMs), focusing on pixel-level vision-language alignment. Traditional MLLMs have excelled in image-level understanding but often lacked fine-grained alignment, limiting their efficacy in tasks requiring detailed region-based comprehension. Osprey addresses this gap through mask-text instruction tuning, which integrates fine-grained mask regions into language instruction for achieving pixel-wise visual understanding.

Methodological Innovations

Osprey introduces a novel Mask-Aware Visual Extractor that integrates precise visual mask features with LLMs. This process involves leveraging a convolutional CLIP backbone as the vision encoder, known for its efficient handling of high-resolution images compared to ViT-based models. The approach enhances model coverage by meticulously curating a substantial dataset, Osprey-724K, consisting of mask-based region-text pairs. This dataset is pivotal to extending MLLMs towards pixel-level instructions.

The model operates by injecting pixel-level representations from the masks into a LLM after processing multi-level features through mask pooling and linear transformations. This integration allows Osprey to provide detailed semantic interpretations, object attributions, and complex scene descriptions at both the part and object level.

Experimental Validation

The efficacy of Osprey is demonstrated through extensive experimental tasks, including:

  • Open-Vocabulary Segmentation: Osprey shows a substantial performance increase over existing models like GPT4RoI and Ferret, reflecting its capability in achieving superior pixel-level segmentation and recognition.
  • Referring Object Classification: The model significantly outperforms existing methods on both LVIS and PACO datasets, showcasing its proficiency in identifying and describing nuanced details of object parts and categories.
  • Description and Reasoning Tasks: When evaluated on the Ferret-Bench and detailed region description tasks, Osprey achieves high accuracy and surpasses state-of-the-art models in providing insightful and articulate responses.

The paper also explores the implication of negative samples and short-form prompts in mitigating object hallucination, as evaluated by the POPE benchmark. Osprey's approach to integrating diverse prompts and a robust dataset is validated by its competitive performance across diverse settings.

Implications and Future Directions

The advancements presented by Osprey have significant implications for AI applications requiring detailed image comprehension, such as autonomous systems, detailed scene analysis, and improved human-computer interaction interfaces. The pixel-level alignments forged by Osprey could lead to more interactive and context-aware AI systems that adeptly handle complex visual data.

Future developments could explore the expansion of Osprey's capabilities to more complex datasets and broader application areas. Further improvements in model efficiency, perhaps by integrating more streamlined processing architectures, could also enhance the adaptability and scalability of this approach. Additionally, the exploration of real-time applications and integration with interactive media formats presents a promising avenue for leveraging Osprey’s advanced visual understanding capabilities.

Overall, Osprey contributes significantly to the field of multimodal AI by refining the granularity at which models can understand and interact with visual data, laying a foundation for more detailed and context-rich AI systems in the future.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  3. Language models are few-shot learners. In NeurIPS, pages 1877–1901, 2020.
  4. Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437, 2023a.
  5. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
  6. Detect what you can: Detecting and representing objects using holistic models and body parts. In ECCV, pages 1971–1978, 2014.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  9. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  10. Vocabulary-free image classification. In NeurIPS, 2023.
  11. The cityscapes dataset for semantic urban scene understanding. In ECCV, pages 3213–3223, 2016.
  12. Instructblip: Towards general-purpose visionlanguage models with instruction tuning. In NeurIPS, 2023.
  13. Open-vocabulary universal image segmentation with maskclip. In ICML, 2023.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  15. Lvis: A dataset for large vocabulary instance segmentation. In CVPR, pages 5356–5364, 2019.
  16. Partimagenet: A large, high-quality dataset of parts. In ECCV, pages 128–145. Springer, 2022.
  17. Deep residual learning for image recognition. In ECCV, pages 770–778, 2016.
  18. Segment anything in high quality. arXiv preprint arXiv:2306.01567, 2023.
  19. Segment anything. In ICCV, 2023.
  20. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  21. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
  22. Elevater: A benchmark and toolkit for evaluating language-augmented visual models. In NeurIPS, pages 9287–9301, 2022.
  23. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 1, 2023b.
  24. Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023c.
  25. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023d.
  26. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023e.
  27. Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023f.
  28. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
  29. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  30. Visual instruction tuning. In NeurIPS, 2023b.
  31. A convnet for the 2020s. In CVPR, pages 11976–11986, 2022.
  32. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  33. Decoupled weight decay regularization. In ICLR, 2019.
  34. Generation and comprehension of unambiguous object descriptions. In ECCV, pages 11–20, 2016.
  35. Mircosoft. Deepseed. https://www.deepspeed.ai/, 2023.
  36. OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2022.
  37. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  38. Aims: All-inclusive multi-level segmentation. In NeurIPS, 2023a.
  39. High quality entity segmentation. In ICCV, pages 4047–4056, 2023b.
  40. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  41. Paco: Parts and attributes of common objects. In CVPR, pages 7141–7151, 2023.
  42. Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356, 2023.
  43. Sentence-bert: Sentence embeddings using siamese bert-networks. In EMNLP, 2019.
  44. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  45. Hierarchical open-vocabulary universal image segmentation. In NeurIPS, 2023.
  46. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022.
  47. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, pages 2955–2966, 2023.
  48. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  49. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
  50. Modeling context in referring expressions. In ECCV, pages 69–85. Springer, 2016.
  51. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. In NeurIPS, 2023.
  52. From recognition to cognition: Visual commonsense reasoning. In CVPR, pages 6720–6731, 2019.
  53. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023.
  54. Scene parsing through ade20k dataset. In CVPR, pages 633–641, 2017.
  55. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  56. Segment everything everywhere all at once. In NeurIPS, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yuqian Yuan (10 papers)
  2. Wentong Li (25 papers)
  3. Jian Liu (404 papers)
  4. Dongqi Tang (9 papers)
  5. Xinjie Luo (1 paper)
  6. Chi Qin (2 papers)
  7. Lei Zhang (1689 papers)
  8. Jianke Zhu (68 papers)
Citations (47)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com